Trigonometry ...

rrr314159 · April 17, 2015, 05:05:38 PM

Hello Raistlin,

good to hear you're interested, I get as much satisfaction from sharing my efforts as u do from reading them so it's an even trade

Questions are sensible but others have more experience than me, here's my take on them,

1) I wish we could control cache programatically, or at least a portion of it, but can't; CPU follows its own logic. The way I use a LUT, which is probably same as you and others, it's called within a stream of commands which access other portions of memory such as Video. Those other areas often don't belong in cache because won't be used again, so I would rather retain the LUT; instead it will usually be kicked out. So when the instruction sequence gets back to LUT it often will have to be reloaded. Now, I could make those other memory accesses non-temporal to avoid this but that doesn't necessarily work and is a lot of trouble so I don't. I'm typically accessing sin/cos as much as 250 million per second, would be great if I could do all those together but that's also out of the question. 16384 LUT doesn't fit in my L1 cache anyway, altho it is probably retained in L3 cache (6 MB on my machine). There are a lot of issues - answer is "Yes there are methods by which we can maximize LUT cache usage" but nothing clear or simple, so I don't really try. That's one reason I like the Power Series approach instead. Someone else may have better insight.

2) Engineering / Scientific precision req'ts vary of course depending on many factors but I would say, usually they need more precision than the 7 or so digits I've been concentrating on (I'm doing a graphics app). Scientific wants, I would say, 9; and more is better. Again LUT not 2 satisfactory, need very large table: Chebyshev Power Series are very good in this case. siekmanski posted a set of 5 coeff's for 9 digits; the exact same algos and same techniques we've been posting can be as precise as desired, just find larger sets of 6, 7 .. coeff's. If u don't know how to put them in algo I'll be happy to do it, very straightforward. Gunther works with CERN so his opinion would be valued on this question, I haven't done physics software in many years.

3) SSE and AVX do vector operations, parallel comp of 4 or 8 sin/cos at once. I run these algos on all 4 cores of my machines (Intel i5 and AMD A6), works great - has very large favorable impact! In the old days I left my computer running overnight to accomplish what today takes 1 second! FPU also works fine on multiple cores but it's not vector, only one at a time (altho it does give max possible precision). Not sure if that answers your q .. ?

Thanks again for your interest,

jj2007 · September 04, 2015, 07:22:25 AM

Quote from: Siekmanski on April 15, 2015, 04:36:41 PMAnd this one is our goldmine i think,
http://lolengine.net/wiki/doc/maths/remez

That guy has C++ source code to calculate Sin, Cos, Tan and Exp coeffs at extreme precision.
Those are of my interest ( for my mathlib ), unfortunatly i don't have a C++ compiler to make use of it.....
Maybe you can calculate the coeffs for us ?

This is where i found the coeffs for the 9th order polynomial.

There is even a faster one with 4 coeffs,

Code Select Expand
double fastsin2(double x) { const double a3 = -1.666665709650470145824129400050267289858e-1; const double a5 = 8.333017291562218127986291618761571373087e-3; const double a7 = -1.980661520135080504411629636078917643846e-4; const double a9 = 2.600054767890361277123254766503271638682e-6; return x + x*x*x * (a3 + x*x * (a5 + x*x * (a7 + x*x * a9)))); }

Hi Marinus,
Here is a test with the fast algo:

Code Select

Intel(R) Core(TM) i5-2450M CPU @ 2.50GHz
3313 ms for sin
341 ms for sin2

3175 ms for sin
351 ms for sin2

3207 ms for sin
349 ms for sin2

3157 ms for sin
347 ms for sin2

Roughly a factor 9 on the FPU sinus, not bad :t

HSE · September 04, 2015, 12:05:41 PM

Code Select

AMD A6-3500 APU with Radeon(tm) HD Graphics
3209 ms for sin
864 ms for sin2

3251 ms for sin
857 ms for sin2

3238 ms for sin
870 ms for sin2

3145 ms for sin
853 ms for sin2

Very interesting, but here only a factor 4.

Siekmanski · September 11, 2015, 10:17:53 PM

Same algo, 4 single precision values at once

Code Select

.data 
align 16 

OneDivPi    real4 4 dup (0.31830988618379067153776752674) 
Pi          real4 4 dup (3.14159265358979323846264338327) 
ChebyChf40  real4 4 dup (-0.16666657096504701458241294000) 
ChebyChf41  real4 4 dup (0.00833301729156221812798629161) 
ChebyChf42  real4 4 dup (-0.00019806615201350805044116296) 
ChebyChf43  real4 4 dup (0.00000260005476789036127712325) 

.code 

align 16 
SSE2sinSPf proc                             ; Calculate Sine of 1 single or 2,3,4 packed single floating point values ( in: xmm0, result: xmm0 ) 

    mulps       xmm0,oword ptr OneDivPi     ; 1/pi to get a 1pi range 
    cvtps2dq    xmm3,xmm0                   ; (4 packed spfp to 4 packed int32) lose the fractional parts and keep it in xmm3 to save the signs 
    cvtdq2ps    xmm1,xmm3                   ; (4 packed int32 to 4 packed spfp) save the  integral parts 
    subps       xmm0,xmm1                   ; now it's inside the range, results are values between -0.5 to 0.4999999 
    pslld       xmm3,31                     ; put sign-bits in position, to place values in the right hemispheres 
    xorps       xmm0,xmm3                   ; set sign-bits 
    mulps       xmm0,oword ptr Pi           ; restore ranges between -1/2 pi to +1/2 pi 

; Now do the Chebyshev approximation of a 9th degree polynomial 
; With 4 optimized constants for a maximum error of about 3.3381e-9 over -1/2 pi to +1/2 pi 

    movaps      xmm2,xmm0 
    mulps       xmm2,xmm2 

    movaps      xmm1,oword ptr ChebyChf43 
    mulps       xmm1,xmm2 
    addps       xmm1,oword ptr ChebyChf42 
    mulps       xmm1,xmm2 
    addps       xmm1,oword ptr ChebyChf41 
    mulps       xmm1,xmm2 
    addps       xmm1,oword ptr ChebyChf40 
    mulps       xmm2,xmm0 
    mulps       xmm1,xmm2 
    addps       xmm0,xmm1 
    ret 

SSE2sinSPf endp

And 2 double precision values at once:

Code Select

.data 

align 16 
OneDivPiDP  real8 2 dup (0.31830988618379067153776752674) 
PiDP        real8 2 dup (3.14159265358979323846264338327) 

ChebyChf0   real8 2 dup (-0.16666657096504701458241294000) 
ChebyChf1   real8 2 dup (0.00833301729156221812798629161) 
ChebyChf2   real8 2 dup (-0.00019806615201350805044116296) 
ChebyChf3   real8 2 dup (0.00000260005476789036127712325) 

.code 

align 16 
SSE2sinDPf proc                             ; Calculate Sine of 1 double or 2 packed double floating point values ( in: xmm0, result: xmm0 ) 
     
    mulpd       xmm0,oword ptr OneDivPiDP   ; 1/pi to get a 1pi range 
    cvtpd2dq    xmm3,xmm0                   ; (2 packed dpfp to 2 int32) lose the fractional parts and keep it in xmm3 to save the signs 
    cvtdq2pd    xmm1,xmm3                   ; (2 packed int32 to 2 dpfp) save the  integral parts 
    subpd       xmm0,xmm1                   ; now it's inside the range, results are values between -0.5 to 0.4999999 
    pslld       xmm3,31                     ; put sign-bits in position, to place values in the right hemispheres 
    pshufd      xmm3,xmm3,Shuffle(1,3,0,2)  ; shuffle the sign-bits into place          
    xorpd       xmm0,xmm3                   ; set sign-bits 
    mulpd       xmm0,oword ptr PiDP         ; restore ranges between -1/2 pi to +1/2 pi 

; Now do the Chebyshev approximation of a 9th degree polynomial 
; With 4 optimized constants for a maximum error of about 3.3381e-9 over -1/2 pi to +1/2 pi 

    movapd      xmm2,xmm0 
    mulpd       xmm2,xmm2 

    movapd      xmm1,oword ptr ChebyChf3 
    mulpd       xmm1,xmm2 
    addpd       xmm1,oword ptr ChebyChf2 
    mulpd       xmm1,xmm2 
    addpd       xmm1,oword ptr ChebyChf1 
    mulpd       xmm1,xmm2 
    addpd       xmm1,oword ptr ChebyChf0 
    mulpd       xmm2,xmm0 
    mulpd       xmm1,xmm2 
    addpd       xmm0,xmm1 
    ret 
SSE2sinDPf endp

The MASM Forum

News:

Trigonometry ...

rrr314159

jj2007

HSE

Siekmanski