The MASM Forum
General => The Laboratory => Topic started by: jj2007 on April 10, 2013, 10:30:23 PM
-
Hi folks,
Could I please have some timings for these Y=2^x algos?
Thanks, JJ
AMD Athlon(tm) Dual Core Processor 4450B (SSE3)
loop overhead is approx. 79/20 cycles
913 cycles for 20 * Pow2 fist/fild
907 cycles for 20 * Pow2 fadd One
1500 cycles for 20 * Pow2 frndint
-
Updated:
Intel(R) Core(TM) i5-2430M CPU @ 2.40GHz (SSE4)
loop overhead is approx. 30/20 cycles
1037 cycles for 20 * Pow2 fist/fild
1039 cycles for 20 * Pow2 fadd One
1183 cycles for 20 * Pow2 frndint
2301 cycles for 20 * PowX fist/fild
2429 cycles for 20 * PowX frndint
1071 cycles for 20 * Pow2 fist/fild
1046 cycles for 20 * Pow2 fadd One
1153 cycles for 20 * Pow2 frndint
2544 cycles for 20 * PowX fist/fild
2740 cycles for 20 * PowX frndint
1343 cycles for 20 * Pow2 fist/fild
1202 cycles for 20 * Pow2 fadd One
1448 cycles for 20 * Pow2 frndint
2600 cycles for 20 * PowX fist/fild
2438 cycles for 20 * PowX frndint
1295 cycles for 20 * Pow2 fist/fild
1322 cycles for 20 * Pow2 fadd One
1452 cycles for 20 * Pow2 frndint
2775 cycles for 20 * PowX fist/fild
2624 cycles for 20 * PowX frndint
44 bytes for Pow2 fist/fild
38 bytes for Pow2 fadd One
36 bytes for Pow2 frndint
50 bytes for PowX fist/fild
47 bytes for PowX frndint
--- ok --
-
Jochen,
here are my results:
Intel(R) Core(TM) i7-3770 CPU @ 3.40GHz (SSE4)
loop overhead is approx. 77/20 cycles
1036 cycles for 20 * Pow2 fist/fild
1053 cycles for 20 * Pow2 fadd One
24582 cycles for 20 * Pow2 frndint
1027 cycles for 20 * Pow2 fist/fild
1054 cycles for 20 * Pow2 fadd One
24499 cycles for 20 * Pow2 frndint
1026 cycles for 20 * Pow2 fist/fild
1351 cycles for 20 * Pow2 fadd One
24550 cycles for 20 * Pow2 frndint
44 bytes for Pow2 fist/fild
38 bytes for Pow2 fadd One
34 bytes for Pow2 frndint
--- ok ---
By the way: well done. :t
Gunther
-
Intel(R) Core(TM)2 Duo CPU P8600 @ 2.40GHz (SSE4)
loop overhead is approx. 68/20 cycles
2296 cycles for 20 * Pow2 fist/fild
2324 cycles for 20 * Pow2 fadd One
14751 cycles for 20 * Pow2 frndint
2298 cycles for 20 * Pow2 fist/fild
2300 cycles for 20 * Pow2 fadd One
14857 cycles for 20 * Pow2 frndint
2325 cycles for 20 * Pow2 fist/fild
2295 cycles for 20 * Pow2 fadd One
14731 cycles for 20 * Pow2 frndint
44 bytes for Pow2 fist/fild
38 bytes for Pow2 fadd One
34 bytes for Pow2 frndint
-
Hi,
P-III, others if wanted.
pre-P4 (SSE1)
loop overhead is approx. 48/20 cycles
2665 cycles for 20 * Pow2 fist/fild
2654 cycles for 20 * Pow2 fadd One
8635 cycles for 20 * Pow2 frndint
2664 cycles for 20 * Pow2 fist/fild
2655 cycles for 20 * Pow2 fadd One
8635 cycles for 20 * Pow2 frndint
2664 cycles for 20 * Pow2 fist/fild
2650 cycles for 20 * Pow2 fadd One
8635 cycles for 20 * Pow2 frndint
44 bytes for Pow2 fist/fild
38 bytes for Pow2 fadd One
34 bytes for Pow2 frndint
--- ok ---
Regards,
Steve N.
-
prescott w/htt
Intel(R) Pentium(R) 4 CPU 3.00GHz (SSE3)
loop overhead is approx. 51/20 cycles
5718 cycles for 20 * Pow2 fist/fild
5759 cycles for 20 * Pow2 fadd One
78540 cycles for 20 * Pow2 frndint
5740 cycles for 20 * Pow2 fist/fild
5725 cycles for 20 * Pow2 fadd One
78548 cycles for 20 * Pow2 frndint
5732 cycles for 20 * Pow2 fist/fild
5716 cycles for 20 * Pow2 fadd One
79000 cycles for 20 * Pow2 frndint
i don't know what's in frndint, but it doesn't like P4's - lol
-
AMD Phenom(tm) II X6 1100T Processor (SSE3)
loop overhead is approx. 75/20 cycles
895 cycles for 20 * Pow2 fist/fild
890 cycles for 20 * Pow2 fadd One
1465 cycles for 20 * Pow2 frndint
Intel(R) Core(TM) i3-3110M CPU @ 2.40GHz (SSE4)
loop overhead is approx. 39/20 cycles
1254 cycles for 20 * Pow2 fist/fild
1279 cycles for 20 * Pow2 fadd One
28154 cycles for 20 * Pow2 frndint
-
i don't know what's in frndint, but it doesn't like P4's - lol
Could be related to the fact that I forgot one fld st in that algo :redface:
New version attached on top of this thread. My apologies for having wasted your time. To compensate, I added z=x^y to the list - see PowX.
AMD Athlon(tm) Dual Core Processor 4450B (SSE3)
912 cycles for 20 * Pow2 fist/fild
980 cycles for 20 * Pow2 fadd One
1041 cycles for 20 * Pow2 frndint
4059 cycles for 20 * PowX fist/fild
3982 cycles for 20 * PowX frndint
841 cycles for 20 * Pow2 fist/fild
906 cycles for 20 * Pow2 fadd One
1039 cycles for 20 * Pow2 frndint
4054 cycles for 20 * PowX fist/fild
4065 cycles for 20 * PowX frndint
-
much better :biggrin:
prescott w/htt
Intel(R) Pentium(R) 4 CPU 3.00GHz (SSE3)
loop overhead is approx. 47/20 cycles
5724 cycles for 20 * Pow2 fist/fild
5745 cycles for 20 * Pow2 fadd One
5932 cycles for 20 * Pow2 frndint
8048 cycles for 20 * PowX fist/fild
8082 cycles for 20 * PowX frndint
5739 cycles for 20 * Pow2 fist/fild
5715 cycles for 20 * Pow2 fadd One
5947 cycles for 20 * Pow2 frndint
8371 cycles for 20 * PowX fist/fild
8105 cycles for 20 * PowX frndint
5737 cycles for 20 * Pow2 fist/fild
5759 cycles for 20 * Pow2 fadd One
5924 cycles for 20 * Pow2 frndint
8060 cycles for 20 * PowX fist/fild
8094 cycles for 20 * PowX frndint
5748 cycles for 20 * Pow2 fist/fild
5736 cycles for 20 * Pow2 fadd One
5930 cycles for 20 * Pow2 frndint
8021 cycles for 20 * PowX fist/fild
8116 cycles for 20 * PowX frndint
-
Jochen,
never mind; no need for excuses. Here are the new timings:
Intel(R) Core(TM) i7-3770 CPU @ 3.40GHz (SSE4)
loop overhead is approx. 78/20 cycles
1481 cycles for 20 * Pow2 fist/fild
1505 cycles for 20 * Pow2 fadd One
1479 cycles for 20 * Pow2 frndint
2755 cycles for 20 * PowX fist/fild
2820 cycles for 20 * PowX frndint
1325 cycles for 20 * Pow2 fist/fild
1356 cycles for 20 * Pow2 fadd One
1636 cycles for 20 * Pow2 frndint
2774 cycles for 20 * PowX fist/fild
2810 cycles for 20 * PowX frndint
1325 cycles for 20 * Pow2 fist/fild
1352 cycles for 20 * Pow2 fadd One
1644 cycles for 20 * Pow2 frndint
2595 cycles for 20 * PowX fist/fild
2794 cycles for 20 * PowX frndint
1494 cycles for 20 * Pow2 fist/fild
1509 cycles for 20 * Pow2 fadd One
1180 cycles for 20 * Pow2 frndint
2591 cycles for 20 * PowX fist/fild
2803 cycles for 20 * PowX frndint
44 bytes for Pow2 fist/fild
38 bytes for Pow2 fadd One
36 bytes for Pow2 frndint
50 bytes for PowX fist/fild
47 bytes for PowX frndint
--- ok ---
Gunther
-
P-III
pre-P4 (SSE1)
loop overhead is approx. 48/20 cycles
2665 cycles for 20 * Pow2 fist/fild
2652 cycles for 20 * Pow2 fadd One
2717 cycles for 20 * Pow2 frndint
4829 cycles for 20 * PowX fist/fild
5023 cycles for 20 * PowX frndint
2664 cycles for 20 * Pow2 fist/fild
2658 cycles for 20 * Pow2 fadd One
2711 cycles for 20 * Pow2 frndint
4833 cycles for 20 * PowX fist/fild
5017 cycles for 20 * PowX frndint
2670 cycles for 20 * Pow2 fist/fild
2652 cycles for 20 * Pow2 fadd One
2712 cycles for 20 * Pow2 frndint
4835 cycles for 20 * PowX fist/fild
5028 cycles for 20 * PowX frndint
2664 cycles for 20 * Pow2 fist/fild
2653 cycles for 20 * Pow2 fadd One
2711 cycles for 20 * Pow2 frndint
4837 cycles for 20 * PowX fist/fild
5026 cycles for 20 * PowX frndint
44 bytes for Pow2 fist/fild
38 bytes for Pow2 fadd One
36 bytes for Pow2 frndint
50 bytes for PowX fist/fild
47 bytes for PowX frndint
--- ok ---
-
it may be also interesting to replace FSCALE by non-FPU instructions, because it does nothing more than st(0)*2^rndint(st(1)). This could be replaced by code that directly manipulates the exponent field (of value = 1.0) to get 2^rndint(st(1)). Even we already have the rounded value of st(1) as integer...
-
i was playing with a little code to do that
and, while it may be simple to manipulate directly in 99.99 % of the cases, :P
there are those special cases where you need a bunch of if/else statements to handle properly
-
i was playing with a little code to do that
and, while it may be simple to manipulate directly in 99.99 % of the cases, :P
there are those special cases where you need a bunch of if/else statements to handle properly
the following code should do it in all cases of valid input:
; calc: 2^x = 2^(a+b) = 2^a*2^b = 2^fract_part(st0)*2^int_part(st0)
; In: st0 == exponent
; Out: st0 == 2^st0
pow2 proc
LOCAL r10[3]:DWORD
LOCAL exp:SDWORD
mov eax,3fffh
fist exp
fisub exp
add eax,exp ; case: add 3fffh,-X ==> sub 3fffh,X
jle @err1 ; underflow
cmp eax,8000h
jae @err2 ; overflow
mov r10[0],0
mov r10[4],80000000h
mov r10[8],eax
f2xm1
fadd FP4(1.0)
fld REAL10 ptr r10
fmulp st(1),st
ret
@err1:
fstp st(0)
fld FP4(0.0)
ret
@err2:
fstp st(0)
fld FP4(07F800000r) ; infinite
ret
pow2 endp
EDIT: overflow detection was incomplete
-
Looks very competitive, qWord - compliments :t
Would you mind adding it to MasmBasic, with proper acknowledgement, of course?
EDIT: Shaved off a cycle and a few bytes. See Reply#16 for attachment.
-
Looks very competitive, qWord - compliments :t
Would you mind adding it to MasmBasic, with proper acknowledgement, of course?
I'm afraid Agner Fog had found that long before me ;-D
Intel(R) Core(TM) i7-3610QM CPU @ 2.30GHz (SSE4)
loop overhead is approx. 64/20 cycles
1194 cycles for 20 * Pow2 fist/fild
1118 cycles for 20 * Pow2 fadd One
1189 cycles for 20 * Pow2 frndint
2202 cycles for 20 * PowX fist/fild
754 cycles for 20 * pow qWord
885 cycles for 20 * Pow2 fist/fild
1011 cycles for 20 * Pow2 fadd One
1240 cycles for 20 * Pow2 frndint
2100 cycles for 20 * PowX fist/fild
752 cycles for 20 * pow qWord
1085 cycles for 20 * Pow2 fist/fild
1118 cycles for 20 * Pow2 fadd One
1354 cycles for 20 * Pow2 frndint
2147 cycles for 20 * PowX fist/fild
925 cycles for 20 * pow qWord
877 cycles for 20 * Pow2 fist/fild
894 cycles for 20 * Pow2 fadd One
1208 cycles for 20 * Pow2 frndint
2103 cycles for 20 * PowX fist/fild
1006 cycles for 20 * pow qWord
44 bytes for Pow2 fist/fild
38 bytes for Pow2 fadd One
36 bytes for Pow2 frndint
50 bytes for PowX fist/fild
96 bytes for pow qWord
--- ok ---
-
I'm afraid Agner Fog had found that long before me ;-D
Just found bitRAKE's version (http://www.asmcommunity.net/board/index.php?PHPSESSID=cf1081402952c2c1ed7f56c24f03a5ba&topic=2979.msg30586#msg30586) - he quotes Agner.
Intel(R) Celeron(R) M CPU 420 @ 1.60GHz (SSE3)
loop overhead is approx. 38/20 cycles
2647 cycles for 20 * Pow2 fist/fild
1399 cycles for 20 * pow qWord/Agner
2654 cycles for 20 * Pow2 fist/fild
1399 cycles for 20 * pow qWord/Agner
2645 cycles for 20 * Pow2 fist/fild
1406 cycles for 20 * pow qWord/Agner
44 bytes for Pow2 fist/fild
90 bytes for pow qWord/Agner
-
Hi, Jochen :t
Intel(R) Celeron(R) CPU 2.13GHz (SSE3)
loop overhead is approx. 68/20 cycles
6315 cycles for 20 * Pow2 fist/fild
4175 cycles for 20 * pow qWord/Agner
6285 cycles for 20 * Pow2 fist/fild
3900 cycles for 20 * pow qWord/Agner
6479 cycles for 20 * Pow2 fist/fild
4086 cycles for 20 * pow qWord/Agner
44 bytes for Pow2 fist/fild
90 bytes for pow qWord/Agner
--- ok ---
-
Thanks, Alex :icon14:
Not much difference for your Celeron, it seems ;-)
-
AMD Phenom(tm) II X6 1100T Processor (SSE3)
loop overhead is approx. 71/20 cycles
837 cycles for 20 * Pow2 fist/fild
1054 cycles for 20 * pow qWord/Agner
Intel(R) Core(TM) i3-3110M CPU @ 2.40GHz (SSE4)
loop overhead is approx. 55/20 cycles
1227 cycles for 20 * Pow2 fist/fild
1065 cycles for 20 * pow qWord/Agner
Intel(R) Core(TM)2 Duo CPU T8100 @ 2.10GHz (SSE4)
loop overhead is approx. 37/20 cycles
2394 cycles for 20 * Pow2 fist/fild
1249 cycles for 20 * pow qWord/Agner
-
Intel(R) Celeron(R) CPU 2.80GHz (SSE3)
loop overhead is approx. 53/20 cycles
5799 cycles for 20 * Pow2 fist/fild
3774 cycles for 20 * pow qWord/Agner
5806 cycles for 20 * Pow2 fist/fild
3794 cycles for 20 * pow qWord/Agner
5801 cycles for 20 * Pow2 fist/fild
3772 cycles for 20 * pow qWord/Agner
44 bytes for Pow2 fist/fild
90 bytes for pow qWord/Agner
--- ok ---
-
Thanks, John and Yves. Apparently the algo doesn't like AMD...
AMD Athlon(tm) Dual Core Processor 4450B (SSE3)
loop overhead is approx. 78/20 cycles
913 cycles for 20 * Pow2 fist/fild
1161 cycles for 20 * pow qWord/Agner
Intel(R) Core(TM) i5 CPU M 520 @ 2.40GHz (SSE4)
loop overhead is approx. 69/20 cycles
1090 cycles for 20 * Pow2 fist/fild
813 cycles for 20 * pow qWord/Agner
-
>Apparently the algo doesn't like AMD...
I sometimes feel left out amid all these intel cpus.
Plenty of timing code seems to be the opposite for my amd, probably because everyone else is intel.
-
Are there instructions that don't work on AMD's or in a non-standard way ?
I used to have a K-6 myself.
Andy
-
Jochen, what if add invoke SetProcessAffinityMask,1 into the init of the program? Maybe on mostly multicore AMD CPUs it just gets switch the thread from one core to another?
-
Jochen, what if add invoke SetProcessAffinityMask,1 into the init of the program? Maybe on mostly multicore AMD CPUs it just gets switch the thread from one core to another?
Alex,
I have been scratching my head all the time why the timings were so volatile on some CPUs... thanks for reminding me of SetProcessAffinityMask :t
I was deeply convinced that I had set it somewhere but nope, it just wasn't included :redface:
The good news is it's now included, see attachment.
The bad news is it doesn't make AMD any faster :icon_mrgreen:
-
thanks for reminding me of SetProcessAffinityMask
:icon_eek:
from the guy who has probably written more timing tests than anyone else - lol
they could probably run a little longer, too
1) select a single core
2) use Sleep,500 after that (or more) to bind and settle
3) try to make each test use about 0.5 seconds
prescott w/htt
Intel(R) Pentium(R) 4 CPU 3.00GHz (SSE3)
loop overhead is approx. 58/20 cycles
5899 cycles for 20 * Pow2 fist/fild
4082 cycles for 20 * pow qWord/Agner
5930 cycles for 20 * Pow2 fist/fild
4049 cycles for 20 * pow qWord/Agner
5925 cycles for 20 * Pow2 fist/fild
4096 cycles for 20 * pow qWord/Agner
-
Jochen,
results from Pow2Timings3.zip:
Intel(R) Core(TM) i7-3770 CPU @ 3.40GHz (SSE4)
loop overhead is approx. 76/20 cycles
1029 cycles for 20 * Pow2 fist/fild
1503 cycles for 20 * pow qWord/Agner
1027 cycles for 20 * Pow2 fist/fild
1726 cycles for 20 * pow qWord/Agner
1639 cycles for 20 * Pow2 fist/fild
899 cycles for 20 * pow qWord/Agner
44 bytes for Pow2 fist/fild
90 bytes for pow qWord/Agner
--- ok ---
Gunther
-
Hi Jochen!
I have been scratching my head all the time why the timings were so volatile on some CPUs... thanks for reminding me of SetProcessAffinityMask :t
I was deeply convinced that I had set it somewhere but nope, it just wasn't included :redface:
The good news is it's now included, see attachment.
The bad news is it doesn't make AMD any faster :icon_mrgreen:
:redface:
At least the timings now will not jump like crazy rabbits in other multicore-affected, and not only, tests where testbed could be used - you made it as very comprehensive template :t
-
At least the timings now will not jump like crazy rabbits in other multicore-affected, and not only, tests where testbed could be used - you made it a very comprehensive template :t
Thanks, Alex.
Exp10, Exp2, ExpE and ExpXY are now implemented, see here (http://www.webalice.it/jj2006/MasmBasicQuickReference.htm#Mb1192).
-
Jochen,
Exp10, Exp2, ExpE and ExpXY are now implemented, see here (http://www.webalice.it/jj2006/MasmBasicQuickReference.htm#Mb1192).
rock solid work. I've checked it. :t
Gunther