FPU power

Siekmanski · October 07, 2015, 03:00:11 AM

Am I right, that pow(x,y) is the same as exp(y * log(x)) ?

For example: pow(2.0, 0.5) = 1.414213562373 as calculated in the routines from Dave and JJ2007.
But if I calculate: exp(0.5 * log(2.0)).
I get 1.16243273894 as a result on my calculator.

log(2.0) = 0.30102999566398119521
exp(0.5 * 0.30102999566398119521) = 1.16243273894

Below a fast Exp approximation routine. ( the log() value is hard coded, needs a fast log routine..... )
It's not exact but it's fast,

Code Select


.data
exp0 real8	0.99962789571721377560130024774
exp1 real8	0.99793872910703643074502373922
exp2 real8	0.50289865085404914825661658491
exp3 real8	0.17648623219024696305508965346
exp4 real8	0.03996291422520886755278730528


Y	real8 	0.5
Xlog_2 	real8	0.30102999566398119521	; log(2.0)
Result	real8	0.0

.code
	movsd	xmm0,real8 ptr [Xlog_2]
	mulsd	xmm0,real8 ptr [Y]

	movsd	xmm1,real8 ptr [exp4]
	mulsd	xmm1,xmm0
	addsd	xmm1,real8 ptr [exp3]
	mulsd	xmm1,xmm0
	addsd	xmm1,real8 ptr [exp2]
	mulsd	xmm1,xmm0
	addsd	xmm1,real8 ptr [exp1]
	mulsd	xmm0,xmm1
	addsd	xmm0,real8 ptr [exp0]
	movsd	Result,xmm0	; result = 1.161847999603
	ret

gelatine1 · October 07, 2015, 05:15:59 AM

exp(y * ln(x)) is equal to x^y. (ln is log with base e...). I calculated exp(0.5 * ln(2.0)) and it is equal to 1.414213562373 as expected.

rrr314159 · October 07, 2015, 05:50:16 AM

Quote from: SiekmanskiAm I right, that pow(x,y) is the same as exp(y * log(x)) ?

Close! But you've mixed up log (base 10) and ln (base e).

The relevant math goes like this,

let f = x^y
ln f = y ln x ; take ln of both sides
f = e ^ (y ln x) ; take exp of both sides

So, pow(x,y) = x^y = e ^ (y ln x). (as gelatine1 says)

OR ELSE,

let f = x^y
log f = y log x ; take log of both sides
f = 10 ^ (y log x) ; take power-of-10 of both sides

So, pow(x,y) = x^y = 10 ^ (y log x)

Check it, both formulas give 1.41421...

10 was used in the old days when we looked answers up in log tables.

Your exp routine is using a set of coefficients close to the standard Taylor series but not the same; I guess those are Chebyshev coefficients? The Taylor series, of course, is

e ^ a = 1 + a + a^2/2 + a^3/3! + a^4/4! + ...

So all you need to make that a full "pow" routine is the ln series, plus some manipulation to scale by powers of 2 so that the series converge. (With computers of course we use powers of 2, in the old days would be powers of 10).

dedndave · October 07, 2015, 07:00:20 AM

x^y = log_-1(y*log(x))

doesn't matter which base you use, so long as it's the same base for log_-1 (antilog) and log
for the FPU, it's base 2, of course

dedndave · October 07, 2015, 07:07:13 AM

the FY2L2X instruction performs the y*log(x) part in a single step

it's the antilog that's a pain - lol
it's not that bad, really
it just looks bad because you have to work on the fraction and exponent seperately

rrr314159 · October 07, 2015, 08:39:36 AM

Quote from: dedndavethe FY2L2X instruction performs the y*log(x) part in a single step

- that's fyl2x (only one 2). It's the log base 2, of course

Quote from: dedndaveit's the antilog that's a pain - lol

- f2xm1 does the antilog (or exp) - base 2 of course

Quote from: dedndaveit's not that bad, really
it just looks bad

- ? Don't get it. What looks bad about it? hardest part is remembering the name of the instructions! I had to check make sure there's only one "2" in fyl2x

- When computed with a power series, both functions require scaling; by 2, in the computer case; by 10, in the good old days. The reason is, the respective power series must converge. No doubt the new instructions use, internally, series; and f2xm1 requires input less than 1. However fyl2x does not; apparently they do the scaling internally for us, which is nice. (Oddly fyl2xp1 does limit the input, X! -according to masm32 fpu tute, anyway.)

- So in that sense you're right, f2xm1 looks more "bad". But for speed, we can't use these instructions anyway (AFAIK) so both operations require scaling. Amounts to using the mantissa in the series, and adjusting the exponent by shifting by powers of 2. That's what fscale does, but again, according to Agner, for speed it should be avoided also

Siekmanski · October 07, 2015, 03:24:06 PM

@gelatine1,
Yes, I've mixed up log (base 10) and ln (base e).

It works with a natural logarithm.
ln(2.0) = 0.69314718056

Next step is a fast natural logarithm routine...

@rrr314159,
Those are Chebyshev coefficients,
http://lolengine.net/wiki/doc/maths/remez/tutorial-relative-error

jj2007 · October 07, 2015, 06:18:14 PM

Quote from: Siekmanski on October 07, 2015, 03:24:06 PM
Those are Chebyshev coefficients,
http://lolengine.net/wiki/doc/maths/remez/tutorial-relative-error

Quotedouble fastexp2(double x)
{
const double a0 = 9.996278957172137756013002477405568253404e-1;
const double a1 = 9.979387291070364307450237392251445288594e-1;
const double a2 = 5.028986508540491482566165849167001609232e-1;
const double a3 = 1.764862321902469630550896534647276819692e-1;
const double a4 = 3.996291422520886755278730528311966596433e-2;

return a0 + x * (a1 + x * (a2 + x * (a3 + x * a4)));
}

Somebody should enlighten these guys about the physical limits of a double aka REAL8 :P

The logic of the Fast Sinus() and Cosinus() algos could be used, but problem is the range. For Sinus & Cosinus, repetition allows to limit the lookup table, but logs have no limits in principle. Of course, if there was a concrete application with a known range, one could cheat... the potential gain is high, the logs are even slower than fsin:

Code Select

Intel(R) Core(TM) i5-2450M CPU @ 2.50GHz (SSE4)
8479    cycles for 100 * fsin
12385   cycles for 100 * lognat
1609    cycles for 100 * fast Sinus

Siekmanski · October 07, 2015, 06:20:41 PM

Still looking for a faster natural log routine...

Code Select

.data
X 	real8 2.0
Y	real8 0.5
log_2	real8 0.69314718055994530941	; log(2.0)
exp0 	real8 0.99962789571721377560
exp1 	real8 0.99793872910703643074
exp2 	real8 0.50289865085404914825
exp3 	real8 0.17648623219024696305
exp4 	real8 0.03996291422520886755

.code

;pow(X,Y) == exp(Y * ln(X)) 

	fld real8 ptr [log_2]
	fld real8 ptr [X]
	fyl2x  ; slow....
	fstp real8 ptr [esp]	; ln(X)

	movsd	xmm0,real8 ptr [esp]
	mulsd	xmm0,real8 ptr [Y]
	movsd	xmm1,real8 ptr [exp4]
	mulsd	xmm1,xmm0
	addsd	xmm1,real8 ptr [exp3]
	mulsd	xmm1,xmm0
	addsd	xmm1,real8 ptr [exp2]
	mulsd	xmm1,xmm0
	addsd	xmm1,real8 ptr [exp1]
	mulsd	xmm0,xmm1
	addsd	xmm0,real8 ptr [exp0]
	
	;result in xmm0

rrr314159 · October 08, 2015, 12:58:41 AM

This is the basic idea, very simple Taylor series:

let z = x - 1, then

ln x = ln(1+z) = z - z^2/2 + z^3/3 - z^4/4 + z^5/5 - z^6/6 ...

It converges pretty well. Chebyshev - type coefficients no doubt a bit better, but this is good enough.

Of course abs val z must be < 1, so use the mantissa in this formula. Then you just add the exponent.

These days I'm trying to take advantage of the last few days of autumn, soon I'll be snow-bound and will write a ln routine, assuming it's still wanted. Maybe get around to it pretty quick

Siekmanski · October 08, 2015, 01:25:37 AM

First I have also other things to do but later I'll try the Chebyshev polynomial algorithm to create a natural log routine.

rrr314159 · October 08, 2015, 08:34:51 AM

Quote from: rrrIt converges pretty well

- u know, actually the ln series won't converge too well. Power series convergence depends on the size of the coefficients, you want them small, but ln has rather large coeff's. Namely, 1, -.5, .333...,-.25, .2, -.1666... etc. Compare to exp where coeff's are inverse of factorials, so instead of 1/4 (.25) you have 1/24 (.041666...). Sin and cos are really good, because every 2nd coeff is 0. So don't expect it to be as easy as it was with them. Chebyshev - same story, they're merely rather more accurate adjustments of Taylor. The only good thing about ln is the alternating signs (true also for sin and cos)

Quote from: jjthe logs are even slower than fsin

- The above explains why ...

Quote from: jjproblem is the range ... logs have no limits in principle

- That's not a problem, actually. The log (ln) series inputs must be limited anyway, for the series to even dream of converging.

Quote from: rrrOf course abs val z must be < 1, so use the mantissa in this formula. Then you just add the exponent.

- Remember log tables, in the good old days? For a number like, say, 12345, you look up log of 1.2345 (the mantissa). Then you add 4 to it, because there are 4 powers of 10. With computers same idea but powers of 2 are used. It's very easy because the log of 10^n is simply n (the exponent).

Quote from: jjSomebody should enlighten these guys about the physical limits of a double aka REAL8

- The rubes are very impressed when they see

Code Select

const double a2 = 5.028986508540491482566165849167001609232e-1; //!
- It's even got a semicolon at the end - wow this guy knows what he's doing! Compare

Code Select

a2 real8 5.0289865085404914

- I'm falling asleep just looking at it. BTW in this case it should end in 5; round up for the next digit, 8. Actually it's more complicated, for that last digit you have to express it to the closest precision as a binary number so it may be best to use 4 (? not sure about that)

- Other than PR there are a couple excuses for using all the digits. Someday you may want to change it to real10 or even larger, might as well have them all there for you. Think of the unused digits as a comment. Also ... I once had the impression the compiler actually looked one digit farther and took care of rounding; that was a long time ago, different compiler (C), but I've retained the habit of retaining at least one extra digit

Quote from: SiekmanskiFirst I have also other things to do but later I'll try the Chebyshev polynomial algorithm to create a natural log routine.

- Me too ... there's another problem with ln routine (and, pow(X,Y)) - neither of us need it! We both had a real need for sin / cos in our graphics work. In fact only jj2007 really wants faster pow, for MasmBasic. Maybe mabdelouahab also, but he hasn't mentioned need for speed. Of course I'm happy to help make MasmBasic as fast as possible, and I'm sure you are too (no, I'm not being sarcastic), plus there's the challenge of the thing, but it is, no doubt, easier to put off than trig functions. My MathArt, which you may remember posted in Workshop, was almost twice as fast (with "sin smoothing" option) due to our trig efforts.

Finally it's worth mentioning - and this may be the really fun part - there may be faster ways to tackle pow. The obvious thing is using the standard algo with a fast exp and fast ln; but it's possible to glom them together into one (2-valued) power series, if all you want is the final result (pow). May or may not help

There may be alternatives to the standard power series also. You know, for exp (ln also, but I don't remember those) there are series that converge much faster; but they're not power series and implementation (on computer) is typically so much messier that they're not worth the trouble. Anyway, (note to self:) browse the net a bit looking for tips like this. Find out how Mathematica, MathLab etc does it. When you get a round Tuit ...

jj2007 · October 08, 2015, 10:10:48 AM

Quote from: rrr314159 on October 08, 2015, 08:34:51 AMIn fact only jj2007 really wants faster pow, for MasmBasic

But I have a couple of other things to solve, and ExpXY is fast enough already. Plus, the gliding SetPoly3 thing won't work as smoothly as on the Sinus().

rrr314159 · October 08, 2015, 11:23:54 AM

Quote from: jj2007ExpXY is fast enough

- there's no such thing as "fast enough" 8)

qWord · October 10, 2015, 12:57:06 PM

Quote from: rrr314159 on October 08, 2015, 08:34:51 AM- The rubes are very impressed when they see

Code Select Expand
const double a2 = 5.028986508540491482566165849167001609232e-1; //!
- It's even got a semicolon at the end - wow this guy knows what he's doing! Compare

Code Select Expand
a2 real8 5.0289865085404914

Reminds me of this nice example found in a book:
We are doing the conversion of base 10 to 2 with infinite precision and then round the result to nearest (half to even) REAL4-precision (24 bits) value:
x₁ = 1 + 2^-24 = 1.000000059604644775390625 (this number is the midpoint between 1 and it's floating point successor 1+2^-23)
x₂ = 1 + 2^-24 + 10^-60 = 1.000000059604644775390625000000000000000000000000000000000001
A correct conversion routine* should return 1 (03F800000r) for x₁ and the successor number of 1 for x₂ (=1+2^-23 = 03F800001r)

*e.g. as described in ISO C99 or IEEE754-2008 standart

regards

The MASM Forum

News:

FPU power

Siekmanski

gelatine1

rrr314159

dedndave

dedndave

rrr314159

Siekmanski

jj2007

Siekmanski

rrr314159

Siekmanski

rrr314159

jj2007

rrr314159

qWord