I trying to find a FPU power function (x^y)
I believe FYL2X,FYL2XP1 and F2XM1 could help you out. http://www.ray.masmcode.com/tutorial/appen1.htm
That is because x^y = 2^(ylog_2(x)) (possible math mistake there...). I am not sure if theres no better solution.
this is Ray's code, put into PROC form for REAL10's
f10Exp PROTO :LPVOID,:LPVOID,:LPVOID
;***********************************************************************************************
OPTION PROLOGUE:None
OPTION EPILOGUE:None
f10Exp PROC lpResult:LPVOID,lpBase:LPVOID,lpExponent:LPVOID
;0.0 raised to any exponent returns an indefinate qNaN
;any non-zero base raised to 0.0 returns 1.0
mov ecx,[esp+12] ;ECX = lpExponent
mov edx,[esp+8] ;EDX = lpBase
mov eax,[esp+4] ;EAX = lpResult
fld real10 ptr [ecx]
fld real10 ptr [edx]
fyl2x
fld st
frndint
fsub st(1),st
fxch st(1)
f2xm1
fld1
fadd
fscale
fstp st(1)
fstp real10 ptr [eax]
ret 12
f10Exp ENDP
OPTION PROLOGUE:PrologueDef
OPTION EPILOGUE:EpilogueDef
;***********************************************************************************************
Cool ... Ray's original is in \masm32\fpulib\FpuXexpY.asm, with other good stuff. Your version is useful modification, easier to understand without his extensive inputs checking
include \masm32\MasmBasic\MasmBasic.inc ; download (http://masm32.com/board/index.php?topic=94.0)
Init
PrintLine "exact:", Tb$, "1.41421356237309504880..." ; Wolfram Alpha (http://www.wolframalpha.com/input/?i=2%5E0.5)
Inkey Str$("2^0.5=\t%Jf", ExpXY(2, 0.5)) ; more (http://www.webalice.it/jj2006/MasmBasicQuickReference.htm#Mb1192)
EndOfCode
Output:
exact: 1.41421356237309504880...
2^0.5= 1.414213562373095049
@jj2007,
should have known MasmBasic has exp function! All it needs is a "kitch*n sink" function :P
p.s. if you spell the word "kitch*n" with the e, this forum software substitutes "kitchen". I, and others, have noted this strange interpolation b4
Did you know that the famous cuisine problem (http://masm32.com/board/index.php?topic=4566.msg49025#msg49025) is a feature, not a bug?
:P
Another one,
fld FLT4(0.5)
fld FLT4(2.0)
fyl2x
fld1
fld st(1)
fprem
f2xm1
faddp
fscale
fstp result ; result = 1.414213562373
fstp st(0)
Quote from: jj2007Did you know that the famous cuisine problem is a feature, not a bug?
- No I didn't - and still don't :eusa_snooty: Some bored web-forum-software programmer's idea of a joke apparently :biggrin: But it's nice to get it straight finally, hope no other words are affected
@siekmanski,
*** Nit Pick Alert ***
fprem is slow, perhaps that's why Raymond uses frndint and fsub to get it. OTOH probably not, since speed is not a major concern for him in this library (as he states somewhere). fscale is also slow (per Agner Fog), he recommends shifting the float's exponent bits directly; but in this case the algo would have to be changed since that must be done to a float in memory. Possibly it's best to multiply by the appropriate power of 2; but actually best course would be to ignore this Nit entirely - unless speed is very important
Yes on most cpu's frndint and fsub is faster ( just checked Agner Fog ) On AMD fprem is a little bit faster than frndint.
thx that's useful to know, 2 bad AMD is so different speed-wise; another important example is fsin. Fast routines really should have different versions for Intel and AMD. Almost makes u want to use a (good) C++ optimizing compiler
Yeah, it's very difficult to write one "fastest routine" for all cpu's.
Men ... :icon_eek: , All these instructions, I thought it was just a single instruction like " EXP", I'm looking for a shorthand
the "shorthand" you are looking for is in reply #2 :biggrin:
that is to put it in a PROC, then use it to do the work
.DATA
r10Base REAL10 2.0
r10Exp REAL10 3.0
.DATA?
r10Result REAL10 ?
.CODE
INVOKE f10Exp,offset r10Result,offset r10Base,offset r10Exp
when you're done, r10Result will be 8.0
Ok Dave Thank you, :biggrin:
I've been using REAL8 , or should I use the conversion of REAL10 to REAL8?, Also How to print real8 by printf
print real8 variables with, for instance "%.4g", there are many other format options available also. Both real4 and real10 are not so easy. If you don't care about the extra precision of real10 simplest would be to stick with real8, just convert all the 10's to 8's in dedndave's code
I Thank you rrr314159, It works :t
You're welcome it's a privilege to help such a hard worker as yourself
@Siekmanski et al,
It occurred to me, fyl2x and f2xm1 are similar to fsin; they use a power series to compute the function. And fsin is incredibly slow on Intel, can be beat by factor of 100, as we found out in the Trigonometry thread in Laboratory. Checked Agner, turns out these are also very slow! (On Prescott particularly, probably applies to modern Intels). So should be easy to go much faster. Essentially that would entail going back to the way we did it in the good old days, roll-your-own power series; PITA, but worth it if you need speed. If I get around to it I'll start a Laboratory thread on the topic. Of course at my age that might take a year, anyone else is welcome to look into it
the REAL10 proc can be easily modifed to REAL8
i use REAL10's because that's the "native" size for the FPU, internally
notice that, when you load REAL8's into the FPU, it converts them to REAL10's internally
hopefully, the result is in range when it saves the result to a REAL8 when done :P
f8Exp PROTO :LPVOID,:LPVOID,:LPVOID
;***********************************************************************************************
OPTION PROLOGUE:None
OPTION EPILOGUE:None
f8Exp PROC lpResult:LPVOID,lpBase:LPVOID,lpExponent:LPVOID
;0.0 raised to any exponent returns an indefinate qNaN
;any non-zero base raised to 0.0 returns 1.0
mov ecx,[esp+12] ;ECX = lpExponent
mov edx,[esp+8] ;EDX = lpBase
mov eax,[esp+4] ;EAX = lpResult
fld real8 ptr [ecx]
fld real8 ptr [edx]
fyl2x
fld st
frndint
fsub st(1),st
fxch st(1)
f2xm1
fld1
fadd
fscale
fstp st(1)
fstp real8 ptr [eax]
ret 12
f8Exp ENDP
OPTION PROLOGUE:PrologueDef
OPTION EPILOGUE:EpilogueDef
;***********************************************************************************************
It would be a nice challenge to come up with a faster power function.
Wish I had more time to give it a try.....
Quote from: Siekmanski on October 06, 2015, 08:28:12 PM
It would be a nice challenge to come up with a faster power function.
Testbed attached :biggrin:
Intel(R) Core(TM) i5-2450M CPU @ 2.50GHz (SSE4)
24947 cycles for 100 * f8Exp (Dave)
12338 cycles for 100 * ExpXY (MasmBasic)
25097 cycles for 100 * pow (CRT)
24944 cycles for 100 * f8Exp (Dave)
12329 cycles for 100 * ExpXY (MasmBasic)
24983 cycles for 100 * pow (CRT)
24959 cycles for 100 * f8Exp (Dave)
12298 cycles for 100 * ExpXY (MasmBasic)
24983 cycles for 100 * pow (CRT)
24947 cycles for 100 * f8Exp (Dave)
12345 cycles for 100 * ExpXY (MasmBasic)
24990 cycles for 100 * pow (CRT)
24957 cycles for 100 * f8Exp (Dave)
12328 cycles for 100 * ExpXY (MasmBasic)
24972 cycles for 100 * pow (CRT)
1.41421356237310 for f8Exp (Dave, 2^0.5)
1.73205080756888 for ExpXY (MasmBasic, 3^0.5)
2.00000000000000 for pow (CRT, 4^0.5)
AMD E-450 APU with Radeon(tm) HD Graphics (SSE4)
20886 cycles for 100 * f8Exp (Dave, 2^0.5)
18506 cycles for 100 * ExpXY (MasmBasic, 3^0.5)
30764 cycles for 100 * pow (CRT, 4^0.5)
20947 cycles for 100 * f8Exp (Dave, 2^0.5)
18486 cycles for 100 * ExpXY (MasmBasic, 3^0.5)
31182 cycles for 100 * pow (CRT, 4^0.5)
21028 cycles for 100 * f8Exp (Dave, 2^0.5)
18480 cycles for 100 * ExpXY (MasmBasic, 3^0.5)
30621 cycles for 100 * pow (CRT, 4^0.5)
21278 cycles for 100 * f8Exp (Dave, 2^0.5)
18486 cycles for 100 * ExpXY (MasmBasic, 3^0.5)
30769 cycles for 100 * pow (CRT, 4^0.5)
20891 cycles for 100 * f8Exp (Dave, 2^0.5)
18572 cycles for 100 * ExpXY (MasmBasic, 3^0.5)
30608 cycles for 100 * pow (CRT, 4^0.5)
1.41421356237310 for f8Exp (Dave, 2^0.5)
1.73205080756888 for ExpXY (MasmBasic, 3^0.5)
2.00000000000000 for pow (CRT, 4^0.5)
Intel(R) Core(TM) i5-2430M CPU @ 2.40GHz (SSE4)
25147 cycles for 100 * f8Exp (Dave, 2^0.5)
12421 cycles for 100 * ExpXY (MasmBasic, 3^0.5)
25152 cycles for 100 * pow (CRT, 4^0.5)
25197 cycles for 100 * f8Exp (Dave, 2^0.5)
12404 cycles for 100 * ExpXY (MasmBasic, 3^0.5)
25155 cycles for 100 * pow (CRT, 4^0.5)
25099 cycles for 100 * f8Exp (Dave, 2^0.5)
12393 cycles for 100 * ExpXY (MasmBasic, 3^0.5)
25162 cycles for 100 * pow (CRT, 4^0.5)
25092 cycles for 100 * f8Exp (Dave, 2^0.5)
12388 cycles for 100 * ExpXY (MasmBasic, 3^0.5)
25194 cycles for 100 * pow (CRT, 4^0.5)
25115 cycles for 100 * f8Exp (Dave, 2^0.5)
12417 cycles for 100 * ExpXY (MasmBasic, 3^0.5)
25116 cycles for 100 * pow (CRT, 4^0.5)
1.41421356237310 for f8Exp (Dave, 2^0.5)
1.73205080756888 for ExpXY (MasmBasic, 3^0.5)
2.00000000000000 for pow (CRT, 4^0.5)
Intel(R) Core(TM) i7-4930K CPU @ 3.40GHz (SSE4)
25032 cycles for 100 * f8Exp (Dave, 2^0.5)
12318 cycles for 100 * ExpXY (MasmBasic, 3^0.5)
25215 cycles for 100 * pow (CRT, 4^0.5)
25034 cycles for 100 * f8Exp (Dave, 2^0.5)
12314 cycles for 100 * ExpXY (MasmBasic, 3^0.5)
25169 cycles for 100 * pow (CRT, 4^0.5)
25055 cycles for 100 * f8Exp (Dave, 2^0.5)
12332 cycles for 100 * ExpXY (MasmBasic, 3^0.5)
25148 cycles for 100 * pow (CRT, 4^0.5)
25066 cycles for 100 * f8Exp (Dave, 2^0.5)
12312 cycles for 100 * ExpXY (MasmBasic, 3^0.5)
25172 cycles for 100 * pow (CRT, 4^0.5)
25048 cycles for 100 * f8Exp (Dave, 2^0.5)
12302 cycles for 100 * ExpXY (MasmBasic, 3^0.5)
25132 cycles for 100 * pow (CRT, 4^0.5)
1.41421356237310 for f8Exp (Dave, 2^0.5)
1.73205080756888 for ExpXY (MasmBasic, 3^0.5)
2.00000000000000 for pow (CRT, 4^0.5)
--- ok ---
Thanks jj,
Obvious question, why is ExpXY faster? Aren't you (MasmBasic) using the same instructions: fyl2x and f2xm1; fprem or frndint; and fscale? If instead you're using a traditional algo (pre- fyl2x and f2xm1) that may explain it.
Another obvious one, why different test cases for the three "Units Under Test": square roots of 2, 3, and 4? Judging by the numbers it's not a problem. For some implementations 4^0.5 might go a lot faster, but obviously not with CRT pow. I guess we can suppose any inputs will go at the same speed.
Evidently ExpXY gets less advantage (tho still ahead) with AMD, which (judging by fsin) may have much better implementations of fyl2x and f2xm1. That would make sense if you're using a traditional algo which implements its own power series (for ln and exp)
AMD A6-6310 APU with AMD Radeon R4 Graphics (SSE4)
18776 cycles for 100 * f8Exp (Dave, 2^0.5)
17133 cycles for 100 * ExpXY (MasmBasic, 3^0.5)
22233 cycles for 100 * pow (CRT, 4^0.5)
18549 cycles for 100 * f8Exp (Dave, 2^0.5)
17595 cycles for 100 * ExpXY (MasmBasic, 3^0.5)
22059 cycles for 100 * pow (CRT, 4^0.5)
18576 cycles for 100 * f8Exp (Dave, 2^0.5)
17082 cycles for 100 * ExpXY (MasmBasic, 3^0.5)
22169 cycles for 100 * pow (CRT, 4^0.5)
18787 cycles for 100 * f8Exp (Dave, 2^0.5)
17142 cycles for 100 * ExpXY (MasmBasic, 3^0.5)
22370 cycles for 100 * pow (CRT, 4^0.5)
18554 cycles for 100 * f8Exp (Dave, 2^0.5)
17172 cycles for 100 * ExpXY (MasmBasic, 3^0.5)
22242 cycles for 100 * pow (CRT, 4^0.5)
1.41421356237310 for f8Exp (Dave, 2^0.5)
1.73205080756888 for ExpXY (MasmBasic, 3^0.5)
2.00000000000000 for pow (CRT, 4^0.5)
--- ok ---
Quote from: rrr314159 on October 07, 2015, 12:31:15 AMI guess we can suppose any inputs will go at the same speed.
That's an interesting hypothesis (and it was mine, too), but unfortunately it is strongly at odds with reality 8)
So ... wouldn't it be better to test them with the same inputs? Not that I want to make extra work for you. Only 4^0.5 ought to go faster, and since it's used with the slowest algo, the results should be good anyway
Intel(R) Core(TM) i5-4210U CPU @ 1.70GHz (SSE4)
28576 cycles for 100 * f8Exp (Dave, 2^0.5)
13867 cycles for 100 * ExpXY (MasmBasic, 3^0.5)
30481 cycles for 100 * pow (CRT, 4^0.5)
28425 cycles for 100 * f8Exp (Dave, 2^0.5)
13873 cycles for 100 * ExpXY (MasmBasic, 3^0.5)
27419 cycles for 100 * pow (CRT, 4^0.5)
28460 cycles for 100 * f8Exp (Dave, 2^0.5)
13872 cycles for 100 * ExpXY (MasmBasic, 3^0.5)
27390 cycles for 100 * pow (CRT, 4^0.5)
28419 cycles for 100 * f8Exp (Dave, 2^0.5)
13830 cycles for 100 * ExpXY (MasmBasic, 3^0.5)
27393 cycles for 100 * pow (CRT, 4^0.5)
28441 cycles for 100 * f8Exp (Dave, 2^0.5)
13849 cycles for 100 * ExpXY (MasmBasic, 3^0.5)
27396 cycles for 100 * pow (CRT, 4^0.5)
1.41421356237310 for f8Exp (Dave, 2^0.5)
1.73205080756888 for ExpXY (MasmBasic, 3^0.5)
2.00000000000000 for pow (CRT, 4^0.5)
--- ok ---
Quote from: rrr314159 on October 07, 2015, 12:45:57 AM
So ... wouldn't it be better to test them with the same inputs? Not that I want to make extra work for you. Only 4^0.5 ought to go faster, and since it's used with the slowest algo, the results should be good anyway
Here is one with almost identical values. The little difference is to verify that the algo actually changed the variable...
Intel(R) Core(TM) i5-2450M CPU @ 2.50GHz (SSE4)
14648 cycles for 100 * f8Exp (Dave, 5.4321^0.5)
12378 cycles for 100 * ExpXY (MasmBasic, 5.4322^0.5)
24845 cycles for 100 * pow (CRT, 5.4323^0.5)
14645 cycles for 100 * f8Exp (Dave, 5.4321^0.5)
12376 cycles for 100 * ExpXY (MasmBasic, 5.4322^0.5)
24840 cycles for 100 * pow (CRT, 5.4323^0.5)
14656 cycles for 100 * f8Exp (Dave, 5.4321^0.5)
12378 cycles for 100 * ExpXY (MasmBasic, 5.4322^0.5)
24854 cycles for 100 * pow (CRT, 5.4323^0.5)
14656 cycles for 100 * f8Exp (Dave, 5.4321^0.5)
12373 cycles for 100 * ExpXY (MasmBasic, 5.4322^0.5)
24837 cycles for 100 * pow (CRT, 5.4323^0.5)
14642 cycles for 100 * f8Exp (Dave, 5.4321^0.5)
12374 cycles for 100 * ExpXY (MasmBasic, 5.4322^0.5)
24848 cycles for 100 * pow (CRT, 5.4323^0.5)
5.43210000000000 for f8Exp (Dave, 5.4321^0.5)^2
5.43220000000000 for ExpXY (MasmBasic, 5.4322^0.5)^2
5.43230000000000 for pow (CRT, 5.4323^0.5)^2
AMD E-450 APU with Radeon(tm) HD Graphics (SSE4)
21041 cycles for 100 * f8Exp (Dave, 5.4321^0.5)
23269 cycles for 100 * ExpXY (MasmBasic, 5.4322^0.5)
32989 cycles for 100 * pow (CRT, 5.4323^0.5)
20927 cycles for 100 * f8Exp (Dave, 5.4321^0.5)
23154 cycles for 100 * ExpXY (MasmBasic, 5.4322^0.5)
33051 cycles for 100 * pow (CRT, 5.4323^0.5)
20892 cycles for 100 * f8Exp (Dave, 5.4321^0.5)
23173 cycles for 100 * ExpXY (MasmBasic, 5.4322^0.5)
33151 cycles for 100 * pow (CRT, 5.4323^0.5)
20886 cycles for 100 * f8Exp (Dave, 5.4321^0.5)
23342 cycles for 100 * ExpXY (MasmBasic, 5.4322^0.5)
33086 cycles for 100 * pow (CRT, 5.4323^0.5)
21158 cycles for 100 * f8Exp (Dave, 5.4321^0.5)
23167 cycles for 100 * ExpXY (MasmBasic, 5.4322^0.5)
33050 cycles for 100 * pow (CRT, 5.4323^0.5)
5.43210000000000 for f8Exp (Dave, 5.4321^0.5)^2
5.43220000000000 for ExpXY (MasmBasic, 5.4322^0.5)^2
5.43230000000000 for pow (CRT, 5.4323^0.5)^2
--- ok ---
Am I right, that pow(x,y) is the same as exp(y * log(x)) ?
For example: pow(2.0, 0.5) = 1.414213562373 as calculated in the routines from Dave and JJ2007.
But if I calculate: exp(0.5 * log(2.0)).
I get 1.16243273894 as a result on my calculator.
log(2.0) = 0.30102999566398119521
exp(0.5 * 0.30102999566398119521) = 1.16243273894
Below a fast Exp approximation routine. ( the log() value is hard coded, needs a fast log routine..... )
It's not exact but it's fast,
.data
exp0 real8 0.99962789571721377560130024774
exp1 real8 0.99793872910703643074502373922
exp2 real8 0.50289865085404914825661658491
exp3 real8 0.17648623219024696305508965346
exp4 real8 0.03996291422520886755278730528
Y real8 0.5
Xlog_2 real8 0.30102999566398119521 ; log(2.0)
Result real8 0.0
.code
movsd xmm0,real8 ptr [Xlog_2]
mulsd xmm0,real8 ptr [Y]
movsd xmm1,real8 ptr [exp4]
mulsd xmm1,xmm0
addsd xmm1,real8 ptr [exp3]
mulsd xmm1,xmm0
addsd xmm1,real8 ptr [exp2]
mulsd xmm1,xmm0
addsd xmm1,real8 ptr [exp1]
mulsd xmm0,xmm1
addsd xmm0,real8 ptr [exp0]
movsd Result,xmm0 ; result = 1.161847999603
ret
exp(y * ln(x)) is equal to x^y. (ln is log with base e...). I calculated exp(0.5 * ln(2.0)) and it is equal to 1.414213562373 as expected.
Quote from: SiekmanskiAm I right, that pow(x,y) is the same as exp(y * log(x)) ?
Close! But you've mixed up log (base 10) and ln (base e).
The relevant math goes like this,
let f = x^y
ln f = y ln x ; take ln of both sides
f = e ^ (y ln x) ; take exp of both sides
So, pow(x,y) = x^y = e ^ (y ln x). (as gelatine1 says)
OR ELSE,
let f = x^y
log f = y log x ; take log of both sides
f = 10 ^ (y log x) ; take power-of-10 of both sides
So, pow(x,y) = x^y = 10 ^ (y log x)
Check it, both formulas give 1.41421...
10 was used in the old days when we looked answers up in log tables.
Your exp routine is using a set of coefficients close to the standard Taylor series but not the same; I guess those are Chebyshev coefficients? The Taylor series, of course, is
e ^ a = 1 + a + a^2/2 + a^3/3! + a^4/4! + ...
So all you need to make that a full "pow" routine is the ln series, plus some manipulation to scale by powers of 2 so that the series converge. (With computers of course we use powers of 2, in the old days would be powers of 10).
xy = log-1(y*log(x))
doesn't matter which base you use, so long as it's the same base for log-1 (antilog) and log
for the FPU, it's base 2, of course
the FY2L2X instruction performs the y*log(x) part in a single step
it's the antilog that's a pain - lol
it's not that bad, really
it just looks bad because you have to work on the fraction and exponent seperately
Quote from: dedndavethe FY2L2X instruction performs the y*log(x) part in a single step
- that's fyl2x (only one 2). It's the log
base 2, of course
Quote from: dedndaveit's the antilog that's a pain - lol
- f2xm1 does the antilog (or exp) - base 2 of course
Quote from: dedndaveit's not that bad, really
it just looks bad
- ? Don't get it. What looks bad about it? hardest part is remembering the name of the instructions! I had to check make sure there's only one "2" in fyl2x :biggrin:
- When computed with a power series, both functions require scaling; by 2, in the computer case; by 10, in the good old days. The reason is, the respective power series must converge. No doubt the new instructions use, internally, series; and f2xm1 requires input less than 1. However fyl2x does not; apparently they do the scaling internally for us, which is nice. (Oddly fyl2xp1 does limit the input, X! -according to masm32 fpu tute, anyway.)
- So in that sense you're right, f2xm1 looks more "bad". But for speed, we can't use these instructions anyway (AFAIK) so both operations require scaling. Amounts to using the mantissa in the series, and adjusting the exponent by shifting by powers of 2. That's what fscale does, but again, according to Agner, for speed it should be avoided also
@gelatine1,
Yes, I've mixed up log (base 10) and ln (base e).
It works with a natural logarithm.
ln(2.0) = 0.69314718056
Next step is a fast natural logarithm routine...
@rrr314159,
Those are Chebyshev coefficients,
http://lolengine.net/wiki/doc/maths/remez/tutorial-relative-error
Quote from: Siekmanski on October 07, 2015, 03:24:06 PM
Those are Chebyshev coefficients,
http://lolengine.net/wiki/doc/maths/remez/tutorial-relative-error
Quotedouble fastexp2(double x)
{
const double a0 = 9.996278957172137756013002477405568253404e-1;
const double a1 = 9.979387291070364307450237392251445288594e-1;
const double a2 = 5.028986508540491482566165849167001609232e-1;
const double a3 = 1.764862321902469630550896534647276819692e-1;
const double a4 = 3.996291422520886755278730528311966596433e-2;
return a0 + x * (a1 + x * (a2 + x * (a3 + x * a4)));
}
Somebody should enlighten these guys about the physical limits of a double aka REAL8 :P
The logic of the Fast Sinus() and Cosinus() algos (http://masm32.com/board/index.php?topic=4580.0) could be used, but problem is the range. For Sinus & Cosinus, repetition allows to limit the lookup table, but logs have no limits in principle. Of course, if there was a concrete application with a known range, one could cheat... the potential gain is high, the logs are even slower than fsin:
Intel(R) Core(TM) i5-2450M CPU @ 2.50GHz (SSE4)
8479 cycles for 100 * fsin
12385 cycles for 100 * lognat
1609 cycles for 100 * fast Sinus
Still looking for a faster natural log routine... :biggrin:
.data
X real8 2.0
Y real8 0.5
log_2 real8 0.69314718055994530941 ; log(2.0)
exp0 real8 0.99962789571721377560
exp1 real8 0.99793872910703643074
exp2 real8 0.50289865085404914825
exp3 real8 0.17648623219024696305
exp4 real8 0.03996291422520886755
.code
;pow(X,Y) == exp(Y * ln(X))
fld real8 ptr [log_2]
fld real8 ptr [X]
fyl2x ; slow....
fstp real8 ptr [esp] ; ln(X)
movsd xmm0,real8 ptr [esp]
mulsd xmm0,real8 ptr [Y]
movsd xmm1,real8 ptr [exp4]
mulsd xmm1,xmm0
addsd xmm1,real8 ptr [exp3]
mulsd xmm1,xmm0
addsd xmm1,real8 ptr [exp2]
mulsd xmm1,xmm0
addsd xmm1,real8 ptr [exp1]
mulsd xmm0,xmm1
addsd xmm0,real8 ptr [exp0]
;result in xmm0
This is the basic idea, very simple Taylor series:
let z = x - 1, then
ln x = ln(1+z) = z - z^2/2 + z^3/3 - z^4/4 + z^5/5 - z^6/6 ...
It converges pretty well. Chebyshev - type coefficients no doubt a bit better, but this is good enough.
Of course abs val z must be < 1, so use the mantissa in this formula. Then you just add the exponent.
These days I'm trying to take advantage of the last few days of autumn, soon I'll be snow-bound and will write a ln routine, assuming it's still wanted. Maybe get around to it pretty quick
First I have also other things to do but later I'll try the Chebyshev polynomial algorithm to create a natural log routine.
Quote from: rrrIt converges pretty well
- u know, actually the ln series won't converge too well. Power series convergence depends on the size of the coefficients, you want them small, but ln has rather large coeff's. Namely, 1, -.5, .333...,-.25, .2, -.1666... etc. Compare to exp where coeff's are inverse of factorials, so instead of 1/4 (.25) you have 1/24 (.041666...). Sin and cos are really good, because every 2nd coeff is 0. So don't expect it to be as easy as it was with them. Chebyshev - same story, they're merely rather more accurate adjustments of Taylor. The only good thing about ln is the alternating signs (true also for sin and cos)
Quote from: jjthe logs are even slower than fsin
- The above explains why ...
Quote from: jjproblem is the range ... logs have no limits in principle
- That's not a problem, actually. The log (ln) series inputs must be limited anyway, for the series to even dream of converging.
Quote from: rrrOf course abs val z must be < 1, so use the mantissa in this formula. Then you just add the exponent.
- Remember log tables, in the good old days? For a number like, say, 12345, you look up log of 1.2345 (the
mantissa). Then you add 4 to it, because there are 4 powers of 10. With computers same idea but powers of 2 are used. It's very easy because the log of 10^n is simply n (the
exponent).
Quote from: jjSomebody should enlighten these guys about the physical limits of a double aka REAL8
- The rubes are very impressed when they see
const double a2 = 5.028986508540491482566165849167001609232e-1; //!
- It's even got a semicolon at the end - wow this guy knows what he's doing! Compare
a2 real8 5.0289865085404914
- I'm falling asleep just looking at it. BTW in this case it should end in 5; round up for the next digit, 8. Actually it's more complicated, for that last digit you have to express it to the closest precision as a binary number so it may be best to use 4 (? not sure about that)
- Other than PR there are a couple excuses for using all the digits. Someday you may want to change it to real10 or even larger, might as well have them all there for you. Think of the unused digits as a comment. Also ... I once had the impression the compiler actually looked one digit farther and took care of rounding; that was a long time ago, different compiler (C), but I've retained the habit of retaining at least one extra digit
Quote from: SiekmanskiFirst I have also other things to do but later I'll try the Chebyshev polynomial algorithm to create a natural log routine.
- Me too ... there's another problem with ln routine (and, pow(X,Y)) - neither of us need it! We both had a real need for sin / cos in our graphics work. In fact only jj2007 really wants faster pow, for MasmBasic. Maybe mabdelouahab also, but he hasn't mentioned need for speed. Of course I'm happy to help make MasmBasic as fast as possible, and I'm sure you are too (no, I'm not being sarcastic), plus there's the challenge of the thing, but it is, no doubt, easier to put off than trig functions. My MathArt, which you may remember posted in Workshop, was almost twice as fast (with "sin smoothing" option) due to our trig efforts.
Finally it's worth mentioning - and this may be the really fun part - there may be faster ways to tackle pow. The obvious thing is using the standard algo with a fast exp and fast ln; but it's possible to glom them together into one (2-valued) power series, if all you want is the final result (pow). May or may not help
There may be alternatives to the standard power series also. You know, for exp (ln also, but I don't remember those) there are series that converge
much faster; but they're not power series and implementation (on computer) is typically so much messier that they're not worth the trouble. Anyway, (note to self:) browse the net a bit looking for tips like this. Find out how Mathematica, MathLab etc does it. When you get a round Tuit ...
Quote from: rrr314159 on October 08, 2015, 08:34:51 AMIn fact only jj2007 really wants faster pow, for MasmBasic
:biggrin:
But I have a couple of other things to solve, and ExpXY is fast enough already. Plus, the gliding SetPoly3 (http://www.webalice.it/jj2006/MasmBasicQuickReference.htm#Mb1124) thing won't work as smoothly as on the Sinus().
Quote from: jj2007ExpXY is fast enough
- there's no such thing as "fast enough" 8)
Quote from: rrr314159 on October 08, 2015, 08:34:51 AM- The rubes are very impressed when they see
const double a2 = 5.028986508540491482566165849167001609232e-1; //!
- It's even got a semicolon at the end - wow this guy knows what he's doing! Compare
a2 real8 5.0289865085404914
Reminds me of this nice example found in a book:
We are doing the conversion of base 10 to 2 with infinite precision and then round the result to nearest (half to even) REAL4-precision (24 bits) value:
x
1 = 1 + 2
-24 = 1.000000059604644775390625 (this number is the midpoint between 1 and it's floating point successor 1+2
-23)
x
2 = 1 + 2
-24 + 10
-60 = 1.000000059604644775390625000000000000000000000000000000000001
A correct conversion routine* should return 1 (03F800000r) for x
1 and the successor number of 1 for x
2 (=1+2
-23 = 03F800001r)
*
e.g. as described in ISO C99 or IEEE754-2008 standartregards
So ISO C99 says that at the midpoint round to even - in this case 38f00000 (down) not 38f00001. If it's above, no matter how little (10^-60, or 10^-60000) round up. So a compiler should take account of all the digits in your data statement (e.g. 5.028986508540491482566165849167001609232e-1, or 1.000000059604644775390625000000000000000000000000000000000001) to provide the correct rounding, even though of course they don't appear in the actual value your program will use. That's what I had thought. I wonder if ml.exe and/or JWasm does that. Easy enough to check it out ...
sorry, saying that C99 describes such conversion was not right. For FP literals the standard says that the conversion should be identical to the runtime conversion (with the default runtime rounding mode). The number of significant digits is limit (DECIMAL_DIG@float.h, e.g. 21 for gcc). The standard than says:
QuoteIf the subject sequence has the decimal form and at most DECIMAL_DIG significant digits, the result should be correctly rounded. If the subject sequence D has the decimal form and more than DECIMAL_DIG significant digits, consider the two bounding, adjacent decimal strings L and U, both having DECIMAL_DIG significant digits, such that the values of L, D, and U satisfy L ≤ D ≤ U.
The result should be one of the (equal or adjacent) values that would be obtained by correctly rounding L and U according to the current rounding direction, with the extra stipulation that the error with respect to D should have a correct sign for the current rounding direction
Also to say, the IEEE standard allows - but does not require - a limitation of significant digits.