News:

Masm32 SDK description, downloads and other helpful links
Message to All Guests
NB: Posting URL's See here: Posted URL Change

Main Menu

Trigonometry ...

Started by rrr314159, March 26, 2015, 04:48:39 PM

Previous topic - Next topic

rrr314159

Hello Raistlin,

good to hear you're interested, I get as much satisfaction from sharing my efforts as u do from reading them so it's an even trade

Questions are sensible but others have more experience than me, here's my take on them,

1) I wish we could control cache programatically, or at least a portion of it, but can't; CPU follows its own logic. The way I use a LUT, which is probably same as you and others, it's called within a stream of commands which access other portions of memory such as Video. Those other areas often don't belong in cache because won't be used again, so I would rather retain the LUT; instead it will usually be kicked out. So when the instruction sequence gets back to LUT it often will have to be reloaded. Now, I could make those other memory accesses non-temporal to avoid this but that doesn't necessarily work and is a lot of trouble so I don't. I'm typically accessing sin/cos as much as 250 million per second, would be great if I could do all those together but that's also out of the question. 16384 LUT doesn't fit in my L1 cache anyway, altho it is probably retained in L3 cache (6 MB on my machine). There are a lot of issues - answer is "Yes there are methods by which we can maximize LUT cache usage" but nothing clear or simple, so I don't really try. That's one reason I like the Power Series approach instead. Someone else may have better insight.

2) Engineering / Scientific precision req'ts vary of course depending on many factors but I would say, usually they need more precision than the 7 or so digits I've been concentrating on (I'm doing a graphics app). Scientific wants, I would say, 9; and more is better. Again LUT not 2 satisfactory, need very large table: Chebyshev Power Series are very good in this case. siekmanski posted a set of 5 coeff's for 9 digits; the exact same algos and same techniques we've been posting can be as precise as desired, just find larger sets of 6, 7 .. coeff's. If u don't know how to put them in algo I'll be happy to do it, very straightforward. Gunther works with CERN so his opinion would be valued on this question, I haven't done physics software in many years.

3) SSE and AVX do vector operations, parallel comp of 4 or 8 sin/cos at once. I run these algos on all 4 cores of my machines (Intel i5 and AMD A6), works great - has very large favorable impact! In the old days I left my computer running overnight to accomplish what today takes 1 second! FPU also works fine on multiple cores but it's not vector, only one at a time (altho it does give max possible precision). Not sure if that answers your q .. ?

Thanks again for your interest,
I am NaN ;)

jj2007

Quote from: Siekmanski on April 15, 2015, 04:36:41 PMAnd this one is our goldmine i think,
http://lolengine.net/wiki/doc/maths/remez

That guy has C++ source code to calculate Sin, Cos, Tan and Exp coeffs at extreme precision.
Those are of my interest ( for my mathlib ), unfortunatly i don't have a C++ compiler to make use of it.....
Maybe you can calculate the coeffs for us ?

This is where i found the coeffs for the 9th order polynomial.

There is even a faster one with 4 coeffs,


double fastsin2(double x)
{
    const double a3 = -1.666665709650470145824129400050267289858e-1;
    const double a5 = 8.333017291562218127986291618761571373087e-3;
    const double a7 = -1.980661520135080504411629636078917643846e-4;
    const double a9 = 2.600054767890361277123254766503271638682e-6;

    return x + x*x*x * (a3 + x*x * (a5 + x*x * (a7 + x*x * a9))));
}

Hi Marinus,
Here is a test with the fast algo:
Intel(R) Core(TM) i5-2450M CPU @ 2.50GHz
3313 ms for sin
341 ms for sin2

3175 ms for sin
351 ms for sin2

3207 ms for sin
349 ms for sin2

3157 ms for sin
347 ms for sin2


Roughly a factor 9 on the FPU sinus, not bad :t

HSE

AMD A6-3500 APU with Radeon(tm) HD Graphics
3209 ms for sin
864 ms for sin2

3251 ms for sin
857 ms for sin2

3238 ms for sin
870 ms for sin2

3145 ms for sin
853 ms for sin2


Very interesting, but here only a factor 4.
Equations in Assembly: SmplMath

Siekmanski

Same algo, 4 single precision values at once  :biggrin:

.data
align 16

OneDivPi    real4 4 dup (0.31830988618379067153776752674)
Pi          real4 4 dup (3.14159265358979323846264338327)
ChebyChf40  real4 4 dup (-0.16666657096504701458241294000)
ChebyChf41  real4 4 dup (0.00833301729156221812798629161)
ChebyChf42  real4 4 dup (-0.00019806615201350805044116296)
ChebyChf43  real4 4 dup (0.00000260005476789036127712325)

.code

align 16
SSE2sinSPf proc                             ; Calculate Sine of 1 single or 2,3,4 packed single floating point values ( in: xmm0, result: xmm0 )

    mulps       xmm0,oword ptr OneDivPi     ; 1/pi to get a 1pi range
    cvtps2dq    xmm3,xmm0                   ; (4 packed spfp to 4 packed int32) lose the fractional parts and keep it in xmm3 to save the signs
    cvtdq2ps    xmm1,xmm3                   ; (4 packed int32 to 4 packed spfp) save the  integral parts
    subps       xmm0,xmm1                   ; now it's inside the range, results are values between -0.5 to 0.4999999
    pslld       xmm3,31                     ; put sign-bits in position, to place values in the right hemispheres
    xorps       xmm0,xmm3                   ; set sign-bits
    mulps       xmm0,oword ptr Pi           ; restore ranges between -1/2 pi to +1/2 pi

; Now do the Chebyshev approximation of a 9th degree polynomial
; With 4 optimized constants for a maximum error of about 3.3381e-9 over -1/2 pi to +1/2 pi

    movaps      xmm2,xmm0
    mulps       xmm2,xmm2

    movaps      xmm1,oword ptr ChebyChf43
    mulps       xmm1,xmm2
    addps       xmm1,oword ptr ChebyChf42
    mulps       xmm1,xmm2
    addps       xmm1,oword ptr ChebyChf41
    mulps       xmm1,xmm2
    addps       xmm1,oword ptr ChebyChf40
    mulps       xmm2,xmm0
    mulps       xmm1,xmm2
    addps       xmm0,xmm1
    ret

SSE2sinSPf endp


And 2 double precision values at once:
 
.data

align 16
OneDivPiDP  real8 2 dup (0.31830988618379067153776752674)
PiDP        real8 2 dup (3.14159265358979323846264338327)

ChebyChf0   real8 2 dup (-0.16666657096504701458241294000)
ChebyChf1   real8 2 dup (0.00833301729156221812798629161)
ChebyChf2   real8 2 dup (-0.00019806615201350805044116296)
ChebyChf3   real8 2 dup (0.00000260005476789036127712325)

.code

align 16
SSE2sinDPf proc                             ; Calculate Sine of 1 double or 2 packed double floating point values ( in: xmm0, result: xmm0 )
     
    mulpd       xmm0,oword ptr OneDivPiDP   ; 1/pi to get a 1pi range
    cvtpd2dq    xmm3,xmm0                   ; (2 packed dpfp to 2 int32) lose the fractional parts and keep it in xmm3 to save the signs
    cvtdq2pd    xmm1,xmm3                   ; (2 packed int32 to 2 dpfp) save the  integral parts
    subpd       xmm0,xmm1                   ; now it's inside the range, results are values between -0.5 to 0.4999999
    pslld       xmm3,31                     ; put sign-bits in position, to place values in the right hemispheres
    pshufd      xmm3,xmm3,Shuffle(1,3,0,2)  ; shuffle the sign-bits into place         
    xorpd       xmm0,xmm3                   ; set sign-bits
    mulpd       xmm0,oword ptr PiDP         ; restore ranges between -1/2 pi to +1/2 pi

; Now do the Chebyshev approximation of a 9th degree polynomial
; With 4 optimized constants for a maximum error of about 3.3381e-9 over -1/2 pi to +1/2 pi

    movapd      xmm2,xmm0
    mulpd       xmm2,xmm2

    movapd      xmm1,oword ptr ChebyChf3
    mulpd       xmm1,xmm2
    addpd       xmm1,oword ptr ChebyChf2
    mulpd       xmm1,xmm2
    addpd       xmm1,oword ptr ChebyChf1
    mulpd       xmm1,xmm2
    addpd       xmm1,oword ptr ChebyChf0
    mulpd       xmm2,xmm0
    mulpd       xmm1,xmm2
    addpd       xmm0,xmm1
    ret
SSE2sinDPf endp
Creative coders use backward thinking techniques as a strategy.