Hi folks,
Could I please have some timings for these Y=2^x algos?
Thanks, JJ
AMD Athlon(tm) Dual Core Processor 4450B (SSE3)
loop overhead is approx. 79/20 cycles
913 cycles for 20 * Pow2 fist/fild
907 cycles for 20 * Pow2 fadd One
1500 cycles for 20 * Pow2 frndint
Updated:
Intel(R) Core(TM) i5-2430M CPU @ 2.40GHz (SSE4)
loop overhead is approx. 30/20 cycles
1037 cycles for 20 * Pow2 fist/fild
1039 cycles for 20 * Pow2 fadd One
1183 cycles for 20 * Pow2 frndint
2301 cycles for 20 * PowX fist/fild
2429 cycles for 20 * PowX frndint
1071 cycles for 20 * Pow2 fist/fild
1046 cycles for 20 * Pow2 fadd One
1153 cycles for 20 * Pow2 frndint
2544 cycles for 20 * PowX fist/fild
2740 cycles for 20 * PowX frndint
1343 cycles for 20 * Pow2 fist/fild
1202 cycles for 20 * Pow2 fadd One
1448 cycles for 20 * Pow2 frndint
2600 cycles for 20 * PowX fist/fild
2438 cycles for 20 * PowX frndint
1295 cycles for 20 * Pow2 fist/fild
1322 cycles for 20 * Pow2 fadd One
1452 cycles for 20 * Pow2 frndint
2775 cycles for 20 * PowX fist/fild
2624 cycles for 20 * PowX frndint
44 bytes for Pow2 fist/fild
38 bytes for Pow2 fadd One
36 bytes for Pow2 frndint
50 bytes for PowX fist/fild
47 bytes for PowX frndint
--- ok --
Jochen,
here are my results:
Intel(R) Core(TM) i7-3770 CPU @ 3.40GHz (SSE4)
loop overhead is approx. 77/20 cycles
1036 cycles for 20 * Pow2 fist/fild
1053 cycles for 20 * Pow2 fadd One
24582 cycles for 20 * Pow2 frndint
1027 cycles for 20 * Pow2 fist/fild
1054 cycles for 20 * Pow2 fadd One
24499 cycles for 20 * Pow2 frndint
1026 cycles for 20 * Pow2 fist/fild
1351 cycles for 20 * Pow2 fadd One
24550 cycles for 20 * Pow2 frndint
44 bytes for Pow2 fist/fild
38 bytes for Pow2 fadd One
34 bytes for Pow2 frndint
--- ok ---
By the way: well done. :t
Gunther
Intel(R) Core(TM)2 Duo CPU P8600 @ 2.40GHz (SSE4)
loop overhead is approx. 68/20 cycles
2296 cycles for 20 * Pow2 fist/fild
2324 cycles for 20 * Pow2 fadd One
14751 cycles for 20 * Pow2 frndint
2298 cycles for 20 * Pow2 fist/fild
2300 cycles for 20 * Pow2 fadd One
14857 cycles for 20 * Pow2 frndint
2325 cycles for 20 * Pow2 fist/fild
2295 cycles for 20 * Pow2 fadd One
14731 cycles for 20 * Pow2 frndint
44 bytes for Pow2 fist/fild
38 bytes for Pow2 fadd One
34 bytes for Pow2 frndint
Hi,
P-III, others if wanted.
pre-P4 (SSE1)
loop overhead is approx. 48/20 cycles
2665 cycles for 20 * Pow2 fist/fild
2654 cycles for 20 * Pow2 fadd One
8635 cycles for 20 * Pow2 frndint
2664 cycles for 20 * Pow2 fist/fild
2655 cycles for 20 * Pow2 fadd One
8635 cycles for 20 * Pow2 frndint
2664 cycles for 20 * Pow2 fist/fild
2650 cycles for 20 * Pow2 fadd One
8635 cycles for 20 * Pow2 frndint
44 bytes for Pow2 fist/fild
38 bytes for Pow2 fadd One
34 bytes for Pow2 frndint
--- ok ---
Regards,
Steve N.
prescott w/htt
Intel(R) Pentium(R) 4 CPU 3.00GHz (SSE3)
loop overhead is approx. 51/20 cycles
5718 cycles for 20 * Pow2 fist/fild
5759 cycles for 20 * Pow2 fadd One
78540 cycles for 20 * Pow2 frndint
5740 cycles for 20 * Pow2 fist/fild
5725 cycles for 20 * Pow2 fadd One
78548 cycles for 20 * Pow2 frndint
5732 cycles for 20 * Pow2 fist/fild
5716 cycles for 20 * Pow2 fadd One
79000 cycles for 20 * Pow2 frndint
i don't know what's in frndint, but it doesn't like P4's - lol
AMD Phenom(tm) II X6 1100T Processor (SSE3)
loop overhead is approx. 75/20 cycles
895 cycles for 20 * Pow2 fist/fild
890 cycles for 20 * Pow2 fadd One
1465 cycles for 20 * Pow2 frndint
Intel(R) Core(TM) i3-3110M CPU @ 2.40GHz (SSE4)
loop overhead is approx. 39/20 cycles
1254 cycles for 20 * Pow2 fist/fild
1279 cycles for 20 * Pow2 fadd One
28154 cycles for 20 * Pow2 frndint
Quote from: dedndave on April 11, 2013, 12:26:50 AM
i don't know what's in frndint, but it doesn't like P4's - lol
Could be related to the fact that I forgot one fld st in that algo :redface:
New version attached on top of this thread. My apologies for having wasted your time. To compensate, I added z=x^y to the list - see PowX.
AMD Athlon(tm) Dual Core Processor 4450B (SSE3)
912 cycles for 20 * Pow2 fist/fild
980 cycles for 20 * Pow2 fadd One
1041 cycles for 20 * Pow2 frndint
4059 cycles for 20 * PowX fist/fild
3982 cycles for 20 * PowX frndint
841 cycles for 20 * Pow2 fist/fild
906 cycles for 20 * Pow2 fadd One
1039 cycles for 20 * Pow2 frndint
4054 cycles for 20 * PowX fist/fild
4065 cycles for 20 * PowX frndint
much better :biggrin:
prescott w/htt
Intel(R) Pentium(R) 4 CPU 3.00GHz (SSE3)
loop overhead is approx. 47/20 cycles
5724 cycles for 20 * Pow2 fist/fild
5745 cycles for 20 * Pow2 fadd One
5932 cycles for 20 * Pow2 frndint
8048 cycles for 20 * PowX fist/fild
8082 cycles for 20 * PowX frndint
5739 cycles for 20 * Pow2 fist/fild
5715 cycles for 20 * Pow2 fadd One
5947 cycles for 20 * Pow2 frndint
8371 cycles for 20 * PowX fist/fild
8105 cycles for 20 * PowX frndint
5737 cycles for 20 * Pow2 fist/fild
5759 cycles for 20 * Pow2 fadd One
5924 cycles for 20 * Pow2 frndint
8060 cycles for 20 * PowX fist/fild
8094 cycles for 20 * PowX frndint
5748 cycles for 20 * Pow2 fist/fild
5736 cycles for 20 * Pow2 fadd One
5930 cycles for 20 * Pow2 frndint
8021 cycles for 20 * PowX fist/fild
8116 cycles for 20 * PowX frndint
Jochen,
never mind; no need for excuses. Here are the new timings:
Quote
Intel(R) Core(TM) i7-3770 CPU @ 3.40GHz (SSE4)
loop overhead is approx. 78/20 cycles
1481 cycles for 20 * Pow2 fist/fild
1505 cycles for 20 * Pow2 fadd One
1479 cycles for 20 * Pow2 frndint
2755 cycles for 20 * PowX fist/fild
2820 cycles for 20 * PowX frndint
1325 cycles for 20 * Pow2 fist/fild
1356 cycles for 20 * Pow2 fadd One
1636 cycles for 20 * Pow2 frndint
2774 cycles for 20 * PowX fist/fild
2810 cycles for 20 * PowX frndint
1325 cycles for 20 * Pow2 fist/fild
1352 cycles for 20 * Pow2 fadd One
1644 cycles for 20 * Pow2 frndint
2595 cycles for 20 * PowX fist/fild
2794 cycles for 20 * PowX frndint
1494 cycles for 20 * Pow2 fist/fild
1509 cycles for 20 * Pow2 fadd One
1180 cycles for 20 * Pow2 frndint
2591 cycles for 20 * PowX fist/fild
2803 cycles for 20 * PowX frndint
44 bytes for Pow2 fist/fild
38 bytes for Pow2 fadd One
36 bytes for Pow2 frndint
50 bytes for PowX fist/fild
47 bytes for PowX frndint
--- ok ---
Gunther
P-III
pre-P4 (SSE1)
loop overhead is approx. 48/20 cycles
2665 cycles for 20 * Pow2 fist/fild
2652 cycles for 20 * Pow2 fadd One
2717 cycles for 20 * Pow2 frndint
4829 cycles for 20 * PowX fist/fild
5023 cycles for 20 * PowX frndint
2664 cycles for 20 * Pow2 fist/fild
2658 cycles for 20 * Pow2 fadd One
2711 cycles for 20 * Pow2 frndint
4833 cycles for 20 * PowX fist/fild
5017 cycles for 20 * PowX frndint
2670 cycles for 20 * Pow2 fist/fild
2652 cycles for 20 * Pow2 fadd One
2712 cycles for 20 * Pow2 frndint
4835 cycles for 20 * PowX fist/fild
5028 cycles for 20 * PowX frndint
2664 cycles for 20 * Pow2 fist/fild
2653 cycles for 20 * Pow2 fadd One
2711 cycles for 20 * Pow2 frndint
4837 cycles for 20 * PowX fist/fild
5026 cycles for 20 * PowX frndint
44 bytes for Pow2 fist/fild
38 bytes for Pow2 fadd One
36 bytes for Pow2 frndint
50 bytes for PowX fist/fild
47 bytes for PowX frndint
--- ok ---
it may be also interesting to replace FSCALE by non-FPU instructions, because it does nothing more than st(0)*2^rndint(st(1)). This could be replaced by code that directly manipulates the exponent field (of value = 1.0) to get 2^rndint(st(1)). Even we already have the rounded value of st(1) as integer...
i was playing with a little code to do that
and, while it may be simple to manipulate directly in 99.99 % of the cases, :P
there are those special cases where you need a bunch of if/else statements to handle properly
Quote from: dedndave on April 11, 2013, 05:46:16 AM
i was playing with a little code to do that
and, while it may be simple to manipulate directly in 99.99 % of the cases, :P
there are those special cases where you need a bunch of if/else statements to handle properly
the following code should do it in all cases of valid input:
; calc: 2^x = 2^(a+b) = 2^a*2^b = 2^fract_part(st0)*2^int_part(st0)
; In: st0 == exponent
; Out: st0 == 2^st0
pow2 proc
LOCAL r10[3]:DWORD
LOCAL exp:SDWORD
mov eax,3fffh
fist exp
fisub exp
add eax,exp ; case: add 3fffh,-X ==> sub 3fffh,X
jle @err1 ; underflow
cmp eax,8000h
jae @err2 ; overflow
mov r10[0],0
mov r10[4],80000000h
mov r10[8],eax
f2xm1
fadd FP4(1.0)
fld REAL10 ptr r10
fmulp st(1),st
ret
@err1:
fstp st(0)
fld FP4(0.0)
ret
@err2:
fstp st(0)
fld FP4(07F800000r) ; infinite
ret
pow2 endp
EDIT: overflow detection was incomplete
Looks very competitive, qWord - compliments :t
Would you mind adding it to MasmBasic, with proper acknowledgement, of course?
EDIT: Shaved off a cycle and a few bytes. See Reply#16 for attachment.
Quote from: jj2007 on April 11, 2013, 07:48:51 AM
Looks very competitive, qWord - compliments :t
Would you mind adding it to MasmBasic, with proper acknowledgement, of course?
I'm afraid Agner Fog had found that long before me ;-D
Intel(R) Core(TM) i7-3610QM CPU @ 2.30GHz (SSE4)
loop overhead is approx. 64/20 cycles
1194 cycles for 20 * Pow2 fist/fild
1118 cycles for 20 * Pow2 fadd One
1189 cycles for 20 * Pow2 frndint
2202 cycles for 20 * PowX fist/fild
754 cycles for 20 * pow qWord
885 cycles for 20 * Pow2 fist/fild
1011 cycles for 20 * Pow2 fadd One
1240 cycles for 20 * Pow2 frndint
2100 cycles for 20 * PowX fist/fild
752 cycles for 20 * pow qWord
1085 cycles for 20 * Pow2 fist/fild
1118 cycles for 20 * Pow2 fadd One
1354 cycles for 20 * Pow2 frndint
2147 cycles for 20 * PowX fist/fild
925 cycles for 20 * pow qWord
877 cycles for 20 * Pow2 fist/fild
894 cycles for 20 * Pow2 fadd One
1208 cycles for 20 * Pow2 frndint
2103 cycles for 20 * PowX fist/fild
1006 cycles for 20 * pow qWord
44 bytes for Pow2 fist/fild
38 bytes for Pow2 fadd One
36 bytes for Pow2 frndint
50 bytes for PowX fist/fild
96 bytes for pow qWord
--- ok ---
Quote from: qWord on April 11, 2013, 08:08:48 AM
I'm afraid Agner Fog had found that long before me ;-D
Just found bitRAKE's version (http://www.asmcommunity.net/board/index.php?PHPSESSID=cf1081402952c2c1ed7f56c24f03a5ba&topic=2979.msg30586#msg30586) - he quotes Agner.
Intel(R) Celeron(R) M CPU 420 @ 1.60GHz (SSE3)
loop overhead is approx. 38/20 cycles
2647 cycles for 20 * Pow2 fist/fild
1399 cycles for 20 * pow qWord/Agner
2654 cycles for 20 * Pow2 fist/fild
1399 cycles for 20 * pow qWord/Agner
2645 cycles for 20 * Pow2 fist/fild
1406 cycles for 20 * pow qWord/Agner
44 bytes for Pow2 fist/fild
90 bytes for pow qWord/Agner
Hi, Jochen :t
Intel(R) Celeron(R) CPU 2.13GHz (SSE3)
loop overhead is approx. 68/20 cycles
6315 cycles for 20 * Pow2 fist/fild
4175 cycles for 20 * pow qWord/Agner
6285 cycles for 20 * Pow2 fist/fild
3900 cycles for 20 * pow qWord/Agner
6479 cycles for 20 * Pow2 fist/fild
4086 cycles for 20 * pow qWord/Agner
44 bytes for Pow2 fist/fild
90 bytes for pow qWord/Agner
--- ok ---
Thanks, Alex :icon14:
Not much difference for your Celeron, it seems ;-)
AMD Phenom(tm) II X6 1100T Processor (SSE3)
loop overhead is approx. 71/20 cycles
837 cycles for 20 * Pow2 fist/fild
1054 cycles for 20 * pow qWord/Agner
Intel(R) Core(TM) i3-3110M CPU @ 2.40GHz (SSE4)
loop overhead is approx. 55/20 cycles
1227 cycles for 20 * Pow2 fist/fild
1065 cycles for 20 * pow qWord/Agner
Intel(R) Core(TM)2 Duo CPU T8100 @ 2.10GHz (SSE4)
loop overhead is approx. 37/20 cycles
2394 cycles for 20 * Pow2 fist/fild
1249 cycles for 20 * pow qWord/Agner
Intel(R) Celeron(R) CPU 2.80GHz (SSE3)
loop overhead is approx. 53/20 cycles
5799 cycles for 20 * Pow2 fist/fild
3774 cycles for 20 * pow qWord/Agner
5806 cycles for 20 * Pow2 fist/fild
3794 cycles for 20 * pow qWord/Agner
5801 cycles for 20 * Pow2 fist/fild
3772 cycles for 20 * pow qWord/Agner
44 bytes for Pow2 fist/fild
90 bytes for pow qWord/Agner
--- ok ---
Thanks, John and Yves. Apparently the algo doesn't like AMD...
AMD Athlon(tm) Dual Core Processor 4450B (SSE3)
loop overhead is approx. 78/20 cycles
913 cycles for 20 * Pow2 fist/fild
1161 cycles for 20 * pow qWord/Agner
Intel(R) Core(TM) i5 CPU M 520 @ 2.40GHz (SSE4)
loop overhead is approx. 69/20 cycles
1090 cycles for 20 * Pow2 fist/fild
813 cycles for 20 * pow qWord/Agner
>Apparently the algo doesn't like AMD...
I sometimes feel left out amid all these intel cpus.
Plenty of timing code seems to be the opposite for my amd, probably because everyone else is intel.
Are there instructions that don't work on AMD's or in a non-standard way ?
I used to have a K-6 myself.
Andy
Jochen, what if add invoke SetProcessAffinityMask,1 into the init of the program? Maybe on mostly multicore AMD CPUs it just gets switch the thread from one core to another?
Quote from: Antariy on April 11, 2013, 11:55:28 PM
Jochen, what if add invoke SetProcessAffinityMask,1 into the init of the program? Maybe on mostly multicore AMD CPUs it just gets switch the thread from one core to another?
Alex,
I have been scratching my head all the time why the timings were so volatile on some CPUs... thanks for reminding me of SetProcessAffinityMask :t
I was deeply convinced that I had set it somewhere but nope, it just wasn't included :redface:
The good news is it's now included, see attachment.
The bad news is it doesn't make AMD any faster :icon_mrgreen:
Quotethanks for reminding me of SetProcessAffinityMask
:icon_eek:
from the guy who has probably written more timing tests than anyone else - lol
they could probably run a little longer, too
1) select a single core
2) use Sleep,500 after that (or more) to bind and settle
3) try to make each test use about 0.5 seconds
prescott w/htt
Intel(R) Pentium(R) 4 CPU 3.00GHz (SSE3)
loop overhead is approx. 58/20 cycles
5899 cycles for 20 * Pow2 fist/fild
4082 cycles for 20 * pow qWord/Agner
5930 cycles for 20 * Pow2 fist/fild
4049 cycles for 20 * pow qWord/Agner
5925 cycles for 20 * Pow2 fist/fild
4096 cycles for 20 * pow qWord/Agner
Jochen,
results from Pow2Timings3.zip:
Quote
Intel(R) Core(TM) i7-3770 CPU @ 3.40GHz (SSE4)
loop overhead is approx. 76/20 cycles
1029 cycles for 20 * Pow2 fist/fild
1503 cycles for 20 * pow qWord/Agner
1027 cycles for 20 * Pow2 fist/fild
1726 cycles for 20 * pow qWord/Agner
1639 cycles for 20 * Pow2 fist/fild
899 cycles for 20 * pow qWord/Agner
44 bytes for Pow2 fist/fild
90 bytes for pow qWord/Agner
--- ok ---
Gunther
Hi
Jochen!
Quote from: jj2007 on April 12, 2013, 12:58:59 AM
I have been scratching my head all the time why the timings were so volatile on some CPUs... thanks for reminding me of SetProcessAffinityMask :t
I was deeply convinced that I had set it somewhere but nope, it just wasn't included :redface:
The good news is it's now included, see attachment.
The bad news is it doesn't make AMD any faster :icon_mrgreen:
:redface:
At least the timings now will not jump like crazy rabbits in other multicore-affected, and not only, tests where testbed could be used - you made it as very comprehensive template :t
Quote from: Antariy on April 12, 2013, 02:12:06 PM
At least the timings now will not jump like crazy rabbits in other multicore-affected, and not only, tests where testbed could be used - you made it a very comprehensive template :t
Thanks, Alex.
Exp10, Exp2, ExpE and ExpXY are now implemented, see here (http://www.webalice.it/jj2006/MasmBasicQuickReference.htm#Mb1192).
Jochen,
Quote from: jj2007 on April 15, 2013, 08:48:17 AM
Exp10, Exp2, ExpE and ExpXY are now implemented, see here (http://www.webalice.it/jj2006/MasmBasicQuickReference.htm#Mb1192).
rock solid work. I've checked it. :t
Gunther