Print Page - FPU power

Title: FPU power
Post by: mabdelouahab on October 06, 2015, 02:19:01 AM

I trying to find a FPU power function (x^y)

Title: Re: FPU power
Post by: gelatine1 on October 06, 2015, 02:39:05 AM

I believe FYL2X,FYL2XP1 and F2XM1 could help you out. http://www.ray.masmcode.com/tutorial/appen1.htm
That is because x^y = 2^(ylog_2(x)) (possible math mistake there...). I am not sure if theres no better solution.

Title: Re: FPU power
Post by: dedndave on October 06, 2015, 03:19:21 AM

this is Ray's code, put into PROC form for REAL10's

Code Select

f10Exp PROTO :LPVOID,:LPVOID,:LPVOID

Code Select

;***********************************************************************************************

        OPTION  PROLOGUE:None
        OPTION  EPILOGUE:None

f10Exp PROC lpResult:LPVOID,lpBase:LPVOID,lpExponent:LPVOID

;0.0 raised to any exponent returns an indefinate qNaN
;any non-zero base raised to 0.0 returns 1.0

    mov     ecx,[esp+12]        ;ECX = lpExponent
    mov     edx,[esp+8]         ;EDX = lpBase
    mov     eax,[esp+4]         ;EAX = lpResult
    fld real10 ptr [ecx]
    fld real10 ptr [edx]
    fyl2x
    fld     st
    frndint
    fsub    st(1),st
    fxch    st(1)
    f2xm1
    fld1
    fadd
    fscale
    fstp    st(1)
    fstp real10 ptr [eax]
    ret     12

f10Exp ENDP

        OPTION  PROLOGUE:PrologueDef
        OPTION  EPILOGUE:EpilogueDef

;***********************************************************************************************

Title: Re: FPU power
Post by: rrr314159 on October 06, 2015, 04:44:08 AM

Cool ... Ray's original is in \masm32\fpulib\FpuXexpY.asm, with other good stuff. Your version is useful modification, easier to understand without his extensive inputs checking

Title: Re: FPU power
Post by: jj2007 on October 06, 2015, 04:48:20 AM

include \masm32\MasmBasic\MasmBasic.inc ; download (http://masm32.com/board/index.php?topic=94.0)
Init
PrintLine "exact:", Tb$, "1.41421356237309504880..." ; Wolfram Alpha (http://www.wolframalpha.com/input/?i=2%5E0.5)
Inkey Str$("2^0.5=\t%Jf", ExpXY(2, 0.5)) ; more (http://www.webalice.it/jj2006/MasmBasicQuickReference.htm#Mb1192)
EndOfCode

Output:
exact: 1.41421356237309504880...
2^0.5= 1.414213562373095049

Title: Re: FPU power
Post by: rrr314159 on October 06, 2015, 05:58:39 AM

@jj2007,

should have known MasmBasic has exp function! All it needs is a "kitch*n sink" function :P

p.s. if you spell the word "kitch*n" with the e, this forum software substitutes "kitchen". I, and others, have noted this strange interpolation b4

Title: Re: FPU power
Post by: jj2007 on October 06, 2015, 06:05:29 AM

Did you know that the famous cuisine problem (http://masm32.com/board/index.php?topic=4566.msg49025#msg49025) is a feature, not a bug?
:P

Title: Re: FPU power
Post by: Siekmanski on October 06, 2015, 06:17:31 AM

Another one,

Code Select

	fld	FLT4(0.5)
	fld	FLT4(2.0)
	fyl2x
 	fld1
 	fld st(1)
 	fprem
 	f2xm1
 	faddp
 	fscale
	fstp	result ; result =  1.414213562373
 	fstp st(0)

Title: Re: FPU power
Post by: rrr314159 on October 06, 2015, 06:43:46 AM

Quote from: jj2007Did you know that the famous cuisine problem is a feature, not a bug?

- No I didn't - and still don't :eusa_snooty: Some bored web-forum-software programmer's idea of a joke apparently :biggrin: But it's nice to get it straight finally, hope no other words are affected

@siekmanski,

*** Nit Pick Alert ***

fprem is slow, perhaps that's why Raymond uses frndint and fsub to get it. OTOH probably not, since speed is not a major concern for him in this library (as he states somewhere). fscale is also slow (per Agner Fog), he recommends shifting the float's exponent bits directly; but in this case the algo would have to be changed since that must be done to a float in memory. Possibly it's best to multiply by the appropriate power of 2; but actually best course would be to ignore this Nit entirely - unless speed is very important

Title: Re: FPU power
Post by: Siekmanski on October 06, 2015, 07:13:00 AM

Yes on most cpu's frndint and fsub is faster ( just checked Agner Fog ) On AMD fprem is a little bit faster than frndint.

Title: Re: FPU power
Post by: rrr314159 on October 06, 2015, 07:16:51 AM

thx that's useful to know, 2 bad AMD is so different speed-wise; another important example is fsin. Fast routines really should have different versions for Intel and AMD. Almost makes u want to use a (good) C++ optimizing compiler

Title: Re: FPU power
Post by: Siekmanski on October 06, 2015, 07:21:19 AM

Yeah, it's very difficult to write one "fastest routine" for all cpu's.

Title: Re: FPU power
Post by: mabdelouahab on October 06, 2015, 07:55:10 AM

Men ... :icon_eek: , All these instructions, I thought it was just a single instruction like " EXP", I'm looking for a shorthand

Title: Re: FPU power
Post by: dedndave on October 06, 2015, 08:16:13 AM

the "shorthand" you are looking for is in reply #2 :biggrin:

that is to put it in a PROC, then use it to do the work

Code Select

        .DATA

r10Base REAL10 2.0
r10Exp  REAL10 3.0

        .DATA?

r10Result REAL10 ?

        .CODE

        INVOKE  f10Exp,offset r10Result,offset r10Base,offset r10Exp

when you're done, r10Result will be 8.0

Title: Re: FPU power
Post by: mabdelouahab on October 06, 2015, 10:00:41 AM

Ok Dave Thank you, :biggrin:
I've been using REAL8 , or should I use the conversion of REAL10 to REAL8?, Also How to print real8 by printf

Title: Re: FPU power
Post by: rrr314159 on October 06, 2015, 10:29:00 AM

print real8 variables with, for instance "%.4g", there are many other format options available also. Both real4 and real10 are not so easy. If you don't care about the extra precision of real10 simplest would be to stick with real8, just convert all the 10's to 8's in dedndave's code

Title: Re: FPU power
Post by: mabdelouahab on October 06, 2015, 11:05:31 AM

I Thank you rrr314159, It works :t

Title: Re: FPU power
Post by: rrr314159 on October 06, 2015, 11:16:35 AM

You're welcome it's a privilege to help such a hard worker as yourself

@Siekmanski et al,

It occurred to me, fyl2x and f2xm1 are similar to fsin; they use a power series to compute the function. And fsin is incredibly slow on Intel, can be beat by factor of 100, as we found out in the Trigonometry thread in Laboratory. Checked Agner, turns out these are also very slow! (On Prescott particularly, probably applies to modern Intels). So should be easy to go much faster. Essentially that would entail going back to the way we did it in the good old days, roll-your-own power series; PITA, but worth it if you need speed. If I get around to it I'll start a Laboratory thread on the topic. Of course at my age that might take a year, anyone else is welcome to look into it

Title: Re: FPU power
Post by: dedndave on October 06, 2015, 11:27:38 AM

the REAL10 proc can be easily modifed to REAL8
i use REAL10's because that's the "native" size for the FPU, internally

notice that, when you load REAL8's into the FPU, it converts them to REAL10's internally
hopefully, the result is in range when it saves the result to a REAL8 when done :P

Code Select

f8Exp PROTO :LPVOID,:LPVOID,:LPVOID

Code Select

;***********************************************************************************************

        OPTION  PROLOGUE:None
        OPTION  EPILOGUE:None

f8Exp PROC lpResult:LPVOID,lpBase:LPVOID,lpExponent:LPVOID

;0.0 raised to any exponent returns an indefinate qNaN
;any non-zero base raised to 0.0 returns 1.0

    mov     ecx,[esp+12]        ;ECX = lpExponent
    mov     edx,[esp+8]         ;EDX = lpBase
    mov     eax,[esp+4]         ;EAX = lpResult
    fld real8 ptr [ecx]
    fld real8 ptr [edx]
    fyl2x
    fld     st
    frndint
    fsub    st(1),st
    fxch    st(1)
    f2xm1
    fld1
    fadd
    fscale
    fstp    st(1)
    fstp real8 ptr [eax]
    ret     12

f8Exp ENDP

        OPTION  PROLOGUE:PrologueDef
        OPTION  EPILOGUE:EpilogueDef

;***********************************************************************************************

Title: Re: FPU power
Post by: Siekmanski on October 06, 2015, 08:28:12 PM

It would be a nice challenge to come up with a faster power function.
Wish I had more time to give it a try.....

Title: Re: FPU power
Post by: jj2007 on October 06, 2015, 08:54:36 PM

Quote from: Siekmanski on October 06, 2015, 08:28:12 PM
It would be a nice challenge to come up with a faster power function.

Testbed attached :biggrin:

Code Select

Intel(R) Core(TM) i5-2450M CPU @ 2.50GHz (SSE4)
24947   cycles for 100 * f8Exp (Dave)
12338   cycles for 100 * ExpXY (MasmBasic)
25097   cycles for 100 * pow (CRT)

24944   cycles for 100 * f8Exp (Dave)
12329   cycles for 100 * ExpXY (MasmBasic)
24983   cycles for 100 * pow (CRT)

24959   cycles for 100 * f8Exp (Dave)
12298   cycles for 100 * ExpXY (MasmBasic)
24983   cycles for 100 * pow (CRT)

24947   cycles for 100 * f8Exp (Dave)
12345   cycles for 100 * ExpXY (MasmBasic)
24990   cycles for 100 * pow (CRT)

24957   cycles for 100 * f8Exp (Dave)
12328   cycles for 100 * ExpXY (MasmBasic)
24972   cycles for 100 * pow (CRT)

1.41421356237310 for f8Exp (Dave, 2^0.5)
1.73205080756888 for ExpXY (MasmBasic, 3^0.5)
2.00000000000000 for pow (CRT, 4^0.5)

Title: Re: FPU power
Post by: TWell on October 06, 2015, 09:36:01 PM

Code Select

AMD E-450 APU with Radeon(tm) HD Graphics (SSE4)

20886   cycles for 100 * f8Exp (Dave, 2^0.5)
18506   cycles for 100 * ExpXY (MasmBasic, 3^0.5)
30764   cycles for 100 * pow (CRT, 4^0.5)

20947   cycles for 100 * f8Exp (Dave, 2^0.5)
18486   cycles for 100 * ExpXY (MasmBasic, 3^0.5)
31182   cycles for 100 * pow (CRT, 4^0.5)

21028   cycles for 100 * f8Exp (Dave, 2^0.5)
18480   cycles for 100 * ExpXY (MasmBasic, 3^0.5)
30621   cycles for 100 * pow (CRT, 4^0.5)

21278   cycles for 100 * f8Exp (Dave, 2^0.5)
18486   cycles for 100 * ExpXY (MasmBasic, 3^0.5)
30769   cycles for 100 * pow (CRT, 4^0.5)

20891   cycles for 100 * f8Exp (Dave, 2^0.5)
18572   cycles for 100 * ExpXY (MasmBasic, 3^0.5)
30608   cycles for 100 * pow (CRT, 4^0.5)

1.41421356237310 for f8Exp (Dave, 2^0.5)
1.73205080756888 for ExpXY (MasmBasic, 3^0.5)
2.00000000000000 for pow (CRT, 4^0.5)

Title: Re: FPU power
Post by: gelatine1 on October 06, 2015, 09:39:09 PM

Code Select


Intel(R) Core(TM) i5-2430M CPU @ 2.40GHz (SSE4)

25147   cycles for 100 * f8Exp (Dave, 2^0.5)
12421   cycles for 100 * ExpXY (MasmBasic, 3^0.5)
25152   cycles for 100 * pow (CRT, 4^0.5)

25197   cycles for 100 * f8Exp (Dave, 2^0.5)
12404   cycles for 100 * ExpXY (MasmBasic, 3^0.5)
25155   cycles for 100 * pow (CRT, 4^0.5)

25099   cycles for 100 * f8Exp (Dave, 2^0.5)
12393   cycles for 100 * ExpXY (MasmBasic, 3^0.5)
25162   cycles for 100 * pow (CRT, 4^0.5)

25092   cycles for 100 * f8Exp (Dave, 2^0.5)
12388   cycles for 100 * ExpXY (MasmBasic, 3^0.5)
25194   cycles for 100 * pow (CRT, 4^0.5)

25115   cycles for 100 * f8Exp (Dave, 2^0.5)
12417   cycles for 100 * ExpXY (MasmBasic, 3^0.5)
25116   cycles for 100 * pow (CRT, 4^0.5)

1.41421356237310 for f8Exp (Dave, 2^0.5)
1.73205080756888 for ExpXY (MasmBasic, 3^0.5)
2.00000000000000 for pow (CRT, 4^0.5)

Title: Re: FPU power
Post by: Siekmanski on October 06, 2015, 10:33:55 PM

Code Select

Intel(R) Core(TM) i7-4930K CPU @ 3.40GHz (SSE4)

25032   cycles for 100 * f8Exp (Dave, 2^0.5)
12318   cycles for 100 * ExpXY (MasmBasic, 3^0.5)
25215   cycles for 100 * pow (CRT, 4^0.5)

25034   cycles for 100 * f8Exp (Dave, 2^0.5)
12314   cycles for 100 * ExpXY (MasmBasic, 3^0.5)
25169   cycles for 100 * pow (CRT, 4^0.5)

25055   cycles for 100 * f8Exp (Dave, 2^0.5)
12332   cycles for 100 * ExpXY (MasmBasic, 3^0.5)
25148   cycles for 100 * pow (CRT, 4^0.5)

25066   cycles for 100 * f8Exp (Dave, 2^0.5)
12312   cycles for 100 * ExpXY (MasmBasic, 3^0.5)
25172   cycles for 100 * pow (CRT, 4^0.5)

25048   cycles for 100 * f8Exp (Dave, 2^0.5)
12302   cycles for 100 * ExpXY (MasmBasic, 3^0.5)
25132   cycles for 100 * pow (CRT, 4^0.5)

1.41421356237310 for f8Exp (Dave, 2^0.5)
1.73205080756888 for ExpXY (MasmBasic, 3^0.5)
2.00000000000000 for pow (CRT, 4^0.5)

--- ok ---

Title: Re: FPU power
Post by: rrr314159 on October 07, 2015, 12:31:15 AM

Thanks jj,

Obvious question, why is ExpXY faster? Aren't you (MasmBasic) using the same instructions: fyl2x and f2xm1; fprem or frndint; and fscale? If instead you're using a traditional algo (pre- fyl2x and f2xm1) that may explain it.

Another obvious one, why different test cases for the three "Units Under Test": square roots of 2, 3, and 4? Judging by the numbers it's not a problem. For some implementations 4^0.5 might go a lot faster, but obviously not with CRT pow. I guess we can suppose any inputs will go at the same speed.

Evidently ExpXY gets less advantage (tho still ahead) with AMD, which (judging by fsin) may have much better implementations of fyl2x and f2xm1. That would make sense if you're using a traditional algo which implements its own power series (for ln and exp)

Code Select

AMD A6-6310 APU with AMD Radeon R4 Graphics     (SSE4)

18776   cycles for 100 * f8Exp (Dave, 2^0.5)
17133   cycles for 100 * ExpXY (MasmBasic, 3^0.5)
22233   cycles for 100 * pow (CRT, 4^0.5)

18549   cycles for 100 * f8Exp (Dave, 2^0.5)
17595   cycles for 100 * ExpXY (MasmBasic, 3^0.5)
22059   cycles for 100 * pow (CRT, 4^0.5)

18576   cycles for 100 * f8Exp (Dave, 2^0.5)
17082   cycles for 100 * ExpXY (MasmBasic, 3^0.5)
22169   cycles for 100 * pow (CRT, 4^0.5)

18787   cycles for 100 * f8Exp (Dave, 2^0.5)
17142   cycles for 100 * ExpXY (MasmBasic, 3^0.5)
22370   cycles for 100 * pow (CRT, 4^0.5)

18554   cycles for 100 * f8Exp (Dave, 2^0.5)
17172   cycles for 100 * ExpXY (MasmBasic, 3^0.5)
22242   cycles for 100 * pow (CRT, 4^0.5)

1.41421356237310 for f8Exp (Dave, 2^0.5)
1.73205080756888 for ExpXY (MasmBasic, 3^0.5)
2.00000000000000 for pow (CRT, 4^0.5)

--- ok ---

Title: Re: FPU power
Post by: jj2007 on October 07, 2015, 12:36:38 AM

Quote from: rrr314159 on October 07, 2015, 12:31:15 AMI guess we can suppose any inputs will go at the same speed.

That's an interesting hypothesis (and it was mine, too), but unfortunately it is strongly at odds with reality 8)

Title: Re: FPU power
Post by: rrr314159 on October 07, 2015, 12:45:57 AM

So ... wouldn't it be better to test them with the same inputs? Not that I want to make extra work for you. Only 4^0.5 ought to go faster, and since it's used with the slowest algo, the results should be good anyway

Title: Re: FPU power
Post by: mabdelouahab on October 07, 2015, 12:49:27 AM

Code Select

Intel(R) Core(TM) i5-4210U CPU @ 1.70GHz (SSE4)

28576   cycles for 100 * f8Exp (Dave, 2^0.5)
13867   cycles for 100 * ExpXY (MasmBasic, 3^0.5)
30481   cycles for 100 * pow (CRT, 4^0.5)

28425   cycles for 100 * f8Exp (Dave, 2^0.5)
13873   cycles for 100 * ExpXY (MasmBasic, 3^0.5)
27419   cycles for 100 * pow (CRT, 4^0.5)

28460   cycles for 100 * f8Exp (Dave, 2^0.5)
13872   cycles for 100 * ExpXY (MasmBasic, 3^0.5)
27390   cycles for 100 * pow (CRT, 4^0.5)

28419   cycles for 100 * f8Exp (Dave, 2^0.5)
13830   cycles for 100 * ExpXY (MasmBasic, 3^0.5)
27393   cycles for 100 * pow (CRT, 4^0.5)

28441   cycles for 100 * f8Exp (Dave, 2^0.5)
13849   cycles for 100 * ExpXY (MasmBasic, 3^0.5)
27396   cycles for 100 * pow (CRT, 4^0.5)

1.41421356237310 for f8Exp (Dave, 2^0.5)
1.73205080756888 for ExpXY (MasmBasic, 3^0.5)
2.00000000000000 for pow (CRT, 4^0.5)

--- ok ---

Title: Re: FPU power
Post by: jj2007 on October 07, 2015, 12:58:21 AM

Quote from: rrr314159 on October 07, 2015, 12:45:57 AM
So ... wouldn't it be better to test them with the same inputs? Not that I want to make extra work for you. Only 4^0.5 ought to go faster, and since it's used with the slowest algo, the results should be good anyway

Here is one with almost identical values. The little difference is to verify that the algo actually changed the variable...

Code Select

Intel(R) Core(TM) i5-2450M CPU @ 2.50GHz (SSE4)

14648   cycles for 100 * f8Exp (Dave, 5.4321^0.5)
12378   cycles for 100 * ExpXY (MasmBasic, 5.4322^0.5)
24845   cycles for 100 * pow (CRT, 5.4323^0.5)

14645   cycles for 100 * f8Exp (Dave, 5.4321^0.5)
12376   cycles for 100 * ExpXY (MasmBasic, 5.4322^0.5)
24840   cycles for 100 * pow (CRT, 5.4323^0.5)

14656   cycles for 100 * f8Exp (Dave, 5.4321^0.5)
12378   cycles for 100 * ExpXY (MasmBasic, 5.4322^0.5)
24854   cycles for 100 * pow (CRT, 5.4323^0.5)

14656   cycles for 100 * f8Exp (Dave, 5.4321^0.5)
12373   cycles for 100 * ExpXY (MasmBasic, 5.4322^0.5)
24837   cycles for 100 * pow (CRT, 5.4323^0.5)

14642   cycles for 100 * f8Exp (Dave, 5.4321^0.5)
12374   cycles for 100 * ExpXY (MasmBasic, 5.4322^0.5)
24848   cycles for 100 * pow (CRT, 5.4323^0.5)

5.43210000000000  for f8Exp (Dave, 5.4321^0.5)^2
5.43220000000000  for ExpXY (MasmBasic, 5.4322^0.5)^2
5.43230000000000  for pow (CRT, 5.4323^0.5)^2

Title: Re: FPU power
Post by: TWell on October 07, 2015, 02:54:17 AM

Code Select

AMD E-450 APU with Radeon(tm) HD Graphics (SSE4)

21041   cycles for 100 * f8Exp (Dave, 5.4321^0.5)
23269   cycles for 100 * ExpXY (MasmBasic, 5.4322^0.5)
32989   cycles for 100 * pow (CRT, 5.4323^0.5)

20927   cycles for 100 * f8Exp (Dave, 5.4321^0.5)
23154   cycles for 100 * ExpXY (MasmBasic, 5.4322^0.5)
33051   cycles for 100 * pow (CRT, 5.4323^0.5)

20892   cycles for 100 * f8Exp (Dave, 5.4321^0.5)
23173   cycles for 100 * ExpXY (MasmBasic, 5.4322^0.5)
33151   cycles for 100 * pow (CRT, 5.4323^0.5)

20886   cycles for 100 * f8Exp (Dave, 5.4321^0.5)
23342   cycles for 100 * ExpXY (MasmBasic, 5.4322^0.5)
33086   cycles for 100 * pow (CRT, 5.4323^0.5)

21158   cycles for 100 * f8Exp (Dave, 5.4321^0.5)
23167   cycles for 100 * ExpXY (MasmBasic, 5.4322^0.5)
33050   cycles for 100 * pow (CRT, 5.4323^0.5)

5.43210000000000  for f8Exp (Dave, 5.4321^0.5)^2
5.43220000000000  for ExpXY (MasmBasic, 5.4322^0.5)^2
5.43230000000000  for pow (CRT, 5.4323^0.5)^2

--- ok ---

Title: Re: FPU power
Post by: Siekmanski on October 07, 2015, 03:00:11 AM

Am I right, that pow(x,y) is the same as exp(y * log(x)) ?

For example: pow(2.0, 0.5) = 1.414213562373 as calculated in the routines from Dave and JJ2007.
But if I calculate: exp(0.5 * log(2.0)).
I get 1.16243273894 as a result on my calculator.

log(2.0) = 0.30102999566398119521
exp(0.5 * 0.30102999566398119521) = 1.16243273894

Below a fast Exp approximation routine. ( the log() value is hard coded, needs a fast log routine..... )
It's not exact but it's fast,

Code Select


.data
exp0 real8	0.99962789571721377560130024774
exp1 real8	0.99793872910703643074502373922
exp2 real8	0.50289865085404914825661658491
exp3 real8	0.17648623219024696305508965346
exp4 real8	0.03996291422520886755278730528


Y	real8 	0.5
Xlog_2 	real8	0.30102999566398119521	; log(2.0)
Result	real8	0.0

.code
	movsd	xmm0,real8 ptr [Xlog_2]
	mulsd	xmm0,real8 ptr [Y]

	movsd	xmm1,real8 ptr [exp4]
	mulsd	xmm1,xmm0
	addsd	xmm1,real8 ptr [exp3]
	mulsd	xmm1,xmm0
	addsd	xmm1,real8 ptr [exp2]
	mulsd	xmm1,xmm0
	addsd	xmm1,real8 ptr [exp1]
	mulsd	xmm0,xmm1
	addsd	xmm0,real8 ptr [exp0]
	movsd	Result,xmm0	; result = 1.161847999603
	ret

Title: Re: FPU power
Post by: gelatine1 on October 07, 2015, 05:15:59 AM

exp(y * ln(x)) is equal to x^y. (ln is log with base e...). I calculated exp(0.5 * ln(2.0)) and it is equal to 1.414213562373 as expected.

Title: Re: FPU power
Post by: rrr314159 on October 07, 2015, 05:50:16 AM

Quote from: SiekmanskiAm I right, that pow(x,y) is the same as exp(y * log(x)) ?

Close! But you've mixed up log (base 10) and ln (base e).

The relevant math goes like this,

let f = x^y
ln f = y ln x ; take ln of both sides
f = e ^ (y ln x) ; take exp of both sides

So, pow(x,y) = x^y = e ^ (y ln x). (as gelatine1 says)

OR ELSE,

let f = x^y
log f = y log x ; take log of both sides
f = 10 ^ (y log x) ; take power-of-10 of both sides

So, pow(x,y) = x^y = 10 ^ (y log x)

Check it, both formulas give 1.41421...

10 was used in the old days when we looked answers up in log tables.

Your exp routine is using a set of coefficients close to the standard Taylor series but not the same; I guess those are Chebyshev coefficients? The Taylor series, of course, is

e ^ a = 1 + a + a^2/2 + a^3/3! + a^4/4! + ...

So all you need to make that a full "pow" routine is the ln series, plus some manipulation to scale by powers of 2 so that the series converge. (With computers of course we use powers of 2, in the old days would be powers of 10).

Title: Re: FPU power
Post by: dedndave on October 07, 2015, 07:00:20 AM

x^y = log_-1(y*log(x))

doesn't matter which base you use, so long as it's the same base for log_-1 (antilog) and log
for the FPU, it's base 2, of course

Title: Re: FPU power
Post by: dedndave on October 07, 2015, 07:07:13 AM

the FY2L2X instruction performs the y*log(x) part in a single step

it's the antilog that's a pain - lol
it's not that bad, really
it just looks bad because you have to work on the fraction and exponent seperately

Title: Re: FPU power
Post by: rrr314159 on October 07, 2015, 08:39:36 AM

Quote from: dedndavethe FY2L2X instruction performs the y*log(x) part in a single step

- that's fyl2x (only one 2). It's the log base 2, of course

Quote from: dedndaveit's the antilog that's a pain - lol

- f2xm1 does the antilog (or exp) - base 2 of course

Quote from: dedndaveit's not that bad, really
it just looks bad

- ? Don't get it. What looks bad about it? hardest part is remembering the name of the instructions! I had to check make sure there's only one "2" in fyl2x :biggrin:

- When computed with a power series, both functions require scaling; by 2, in the computer case; by 10, in the good old days. The reason is, the respective power series must converge. No doubt the new instructions use, internally, series; and f2xm1 requires input less than 1. However fyl2x does not; apparently they do the scaling internally for us, which is nice. (Oddly fyl2xp1 does limit the input, X! -according to masm32 fpu tute, anyway.)

- So in that sense you're right, f2xm1 looks more "bad". But for speed, we can't use these instructions anyway (AFAIK) so both operations require scaling. Amounts to using the mantissa in the series, and adjusting the exponent by shifting by powers of 2. That's what fscale does, but again, according to Agner, for speed it should be avoided also

Title: Re: FPU power
Post by: Siekmanski on October 07, 2015, 03:24:06 PM

@gelatine1,
Yes, I've mixed up log (base 10) and ln (base e).

It works with a natural logarithm.
ln(2.0) = 0.69314718056

Next step is a fast natural logarithm routine...

@rrr314159,
Those are Chebyshev coefficients,
http://lolengine.net/wiki/doc/maths/remez/tutorial-relative-error

Title: Re: FPU power
Post by: jj2007 on October 07, 2015, 06:18:14 PM

Quote from: Siekmanski on October 07, 2015, 03:24:06 PM
Those are Chebyshev coefficients,
http://lolengine.net/wiki/doc/maths/remez/tutorial-relative-error

Quotedouble fastexp2(double x)
{
const double a0 = 9.996278957172137756013002477405568253404e-1;
const double a1 = 9.979387291070364307450237392251445288594e-1;
const double a2 = 5.028986508540491482566165849167001609232e-1;
const double a3 = 1.764862321902469630550896534647276819692e-1;
const double a4 = 3.996291422520886755278730528311966596433e-2;

return a0 + x * (a1 + x * (a2 + x * (a3 + x * a4)));
}

Somebody should enlighten these guys about the physical limits of a double aka REAL8 :P

The logic of the Fast Sinus() and Cosinus() algos (http://masm32.com/board/index.php?topic=4580.0) could be used, but problem is the range. For Sinus & Cosinus, repetition allows to limit the lookup table, but logs have no limits in principle. Of course, if there was a concrete application with a known range, one could cheat... the potential gain is high, the logs are even slower than fsin:

Code Select

Intel(R) Core(TM) i5-2450M CPU @ 2.50GHz (SSE4)
8479    cycles for 100 * fsin
12385   cycles for 100 * lognat
1609    cycles for 100 * fast Sinus

Title: Re: FPU power
Post by: Siekmanski on October 07, 2015, 06:20:41 PM

Still looking for a faster natural log routine... :biggrin:

Code Select

.data
X 	real8 2.0
Y	real8 0.5
log_2	real8 0.69314718055994530941	; log(2.0)
exp0 	real8 0.99962789571721377560
exp1 	real8 0.99793872910703643074
exp2 	real8 0.50289865085404914825
exp3 	real8 0.17648623219024696305
exp4 	real8 0.03996291422520886755

.code

;pow(X,Y) == exp(Y * ln(X)) 

	fld real8 ptr [log_2]
	fld real8 ptr [X]
	fyl2x  ; slow....
	fstp real8 ptr [esp]	; ln(X)

	movsd	xmm0,real8 ptr [esp]
	mulsd	xmm0,real8 ptr [Y]
	movsd	xmm1,real8 ptr [exp4]
	mulsd	xmm1,xmm0
	addsd	xmm1,real8 ptr [exp3]
	mulsd	xmm1,xmm0
	addsd	xmm1,real8 ptr [exp2]
	mulsd	xmm1,xmm0
	addsd	xmm1,real8 ptr [exp1]
	mulsd	xmm0,xmm1
	addsd	xmm0,real8 ptr [exp0]
	
	;result in xmm0

Title: Re: FPU power
Post by: rrr314159 on October 08, 2015, 12:58:41 AM

This is the basic idea, very simple Taylor series:

let z = x - 1, then

ln x = ln(1+z) = z - z^2/2 + z^3/3 - z^4/4 + z^5/5 - z^6/6 ...

It converges pretty well. Chebyshev - type coefficients no doubt a bit better, but this is good enough.

Of course abs val z must be < 1, so use the mantissa in this formula. Then you just add the exponent.

These days I'm trying to take advantage of the last few days of autumn, soon I'll be snow-bound and will write a ln routine, assuming it's still wanted. Maybe get around to it pretty quick

Title: Re: FPU power
Post by: Siekmanski on October 08, 2015, 01:25:37 AM

First I have also other things to do but later I'll try the Chebyshev polynomial algorithm to create a natural log routine.

Title: Re: FPU power
Post by: rrr314159 on October 08, 2015, 08:34:51 AM

Quote from: rrrIt converges pretty well

- u know, actually the ln series won't converge too well. Power series convergence depends on the size of the coefficients, you want them small, but ln has rather large coeff's. Namely, 1, -.5, .333...,-.25, .2, -.1666... etc. Compare to exp where coeff's are inverse of factorials, so instead of 1/4 (.25) you have 1/24 (.041666...). Sin and cos are really good, because every 2nd coeff is 0. So don't expect it to be as easy as it was with them. Chebyshev - same story, they're merely rather more accurate adjustments of Taylor. The only good thing about ln is the alternating signs (true also for sin and cos)

Quote from: jjthe logs are even slower than fsin

- The above explains why ...

Quote from: jjproblem is the range ... logs have no limits in principle

- That's not a problem, actually. The log (ln) series inputs must be limited anyway, for the series to even dream of converging.

Quote from: rrrOf course abs val z must be < 1, so use the mantissa in this formula. Then you just add the exponent.

- Remember log tables, in the good old days? For a number like, say, 12345, you look up log of 1.2345 (the mantissa). Then you add 4 to it, because there are 4 powers of 10. With computers same idea but powers of 2 are used. It's very easy because the log of 10^n is simply n (the exponent).

Quote from: jjSomebody should enlighten these guys about the physical limits of a double aka REAL8

- The rubes are very impressed when they see

Code Select

const double a2 = 5.028986508540491482566165849167001609232e-1; //!
- It's even got a semicolon at the end - wow this guy knows what he's doing! Compare

Code Select

a2 real8 5.0289865085404914

- I'm falling asleep just looking at it. BTW in this case it should end in 5; round up for the next digit, 8. Actually it's more complicated, for that last digit you have to express it to the closest precision as a binary number so it may be best to use 4 (? not sure about that)

- Other than PR there are a couple excuses for using all the digits. Someday you may want to change it to real10 or even larger, might as well have them all there for you. Think of the unused digits as a comment. Also ... I once had the impression the compiler actually looked one digit farther and took care of rounding; that was a long time ago, different compiler (C), but I've retained the habit of retaining at least one extra digit

Quote from: SiekmanskiFirst I have also other things to do but later I'll try the Chebyshev polynomial algorithm to create a natural log routine.

- Me too ... there's another problem with ln routine (and, pow(X,Y)) - neither of us need it! We both had a real need for sin / cos in our graphics work. In fact only jj2007 really wants faster pow, for MasmBasic. Maybe mabdelouahab also, but he hasn't mentioned need for speed. Of course I'm happy to help make MasmBasic as fast as possible, and I'm sure you are too (no, I'm not being sarcastic), plus there's the challenge of the thing, but it is, no doubt, easier to put off than trig functions. My MathArt, which you may remember posted in Workshop, was almost twice as fast (with "sin smoothing" option) due to our trig efforts.

Finally it's worth mentioning - and this may be the really fun part - there may be faster ways to tackle pow. The obvious thing is using the standard algo with a fast exp and fast ln; but it's possible to glom them together into one (2-valued) power series, if all you want is the final result (pow). May or may not help

There may be alternatives to the standard power series also. You know, for exp (ln also, but I don't remember those) there are series that converge much faster; but they're not power series and implementation (on computer) is typically so much messier that they're not worth the trouble. Anyway, (note to self:) browse the net a bit looking for tips like this. Find out how Mathematica, MathLab etc does it. When you get a round Tuit ...

Title: Re: FPU power
Post by: jj2007 on October 08, 2015, 10:10:48 AM

Quote from: rrr314159 on October 08, 2015, 08:34:51 AMIn fact only jj2007 really wants faster pow, for MasmBasic

:biggrin:

But I have a couple of other things to solve, and ExpXY is fast enough already. Plus, the gliding SetPoly3 (http://www.webalice.it/jj2006/MasmBasicQuickReference.htm#Mb1124) thing won't work as smoothly as on the Sinus().

Title: Re: FPU power
Post by: rrr314159 on October 08, 2015, 11:23:54 AM

Quote from: jj2007ExpXY is fast enough

- there's no such thing as "fast enough" 8)

Title: Re: FPU power
Post by: qWord on October 10, 2015, 12:57:06 PM

Quote from: rrr314159 on October 08, 2015, 08:34:51 AM- The rubes are very impressed when they see

Code Select Expand
const double a2 = 5.028986508540491482566165849167001609232e-1; //!
- It's even got a semicolon at the end - wow this guy knows what he's doing! Compare

Code Select Expand
a2 real8 5.0289865085404914

Reminds me of this nice example found in a book:
We are doing the conversion of base 10 to 2 with infinite precision and then round the result to nearest (half to even) REAL4-precision (24 bits) value:
x₁ = 1 + 2^-24 = 1.000000059604644775390625 (this number is the midpoint between 1 and it's floating point successor 1+2^-23)
x₂ = 1 + 2^-24 + 10^-60 = 1.000000059604644775390625000000000000000000000000000000000001
A correct conversion routine* should return 1 (03F800000r) for x₁ and the successor number of 1 for x₂ (=1+2^-23 = 03F800001r)

*e.g. as described in ISO C99 or IEEE754-2008 standart

regards

Title: Re: FPU power
Post by: rrr314159 on October 11, 2015, 06:26:04 AM

So ISO C99 says that at the midpoint round to even - in this case 38f00000 (down) not 38f00001. If it's above, no matter how little (10^-60, or 10^-60000) round up. So a compiler should take account of all the digits in your data statement (e.g. 5.028986508540491482566165849167001609232e-1, or 1.000000059604644775390625000000000000000000000000000000000001) to provide the correct rounding, even though of course they don't appear in the actual value your program will use. That's what I had thought. I wonder if ml.exe and/or JWasm does that. Easy enough to check it out ...

Title: Re: FPU power
Post by: qWord on October 11, 2015, 07:59:21 AM

sorry, saying that C99 describes such conversion was not right. For FP literals the standard says that the conversion should be identical to the runtime conversion (with the default runtime rounding mode). The number of significant digits is limit (DECIMAL_DIG@float.h, e.g. 21 for gcc). The standard than says:

QuoteIf the subject sequence has the decimal form and at most DECIMAL_DIG significant digits, the result should be correctly rounded. If the subject sequence D has the decimal form and more than DECIMAL_DIG significant digits, consider the two bounding, adjacent decimal strings L and U, both having DECIMAL_DIG significant digits, such that the values of L, D, and U satisfy L ≤ D ≤ U.
The result should be one of the (equal or adjacent) values that would be obtained by correctly rounding L and U according to the current rounding direction, with the extra stipulation that the error with respect to D should have a correct sign for the current rounding direction

Also to say, the IEEE standard allows - but does not require - a limitation of significant digits.

The MASM Forum

General => The Campus => Topic started by: mabdelouahab on October 06, 2015, 02:19:01 AM