News:

Masm32 SDK description, downloads and other helpful links
Message to All Guests
NB: Posting URL's See here: Posted URL Change

Main Menu

Trigonometry ...

Started by rrr314159, March 26, 2015, 04:48:39 PM

Previous topic - Next topic

rrr314159

thx Gunther, siekmanski, TWell, nidud, dedndave,

it's encouraging to see that the Blank Loop totals are all close, indicating stability. They dropped between the 2 versions of the program as I cleaned it up and eliminated some no-longer used portions. Blank Loop is mostly non-SSE instructions. This indicates to me that the calculation numbers pretty accurately reflect how the machines handle SSE instructions; obviously AMD, and older, machines are not as good (altho AMD is still champ for fsin,  fcos and "ftan"). BTW my Intel numbers beat siekmanski's, altho his machine is actually better; that's only because I selected the best run out of 20 or 30, a real outlier. My normal runs were behind his by a factor approx. 136 to 117. Note this is independent of clock speed; but normally the higher the clock speed, the fewer the cycles, because both are characteristic of more modern CPU's. Be interesting to do a statistical analysis ... but it's a beautiful day, Spring has sprung, and I'm outta here!

Thanks again,
I am NaN ;)

dedndave

i noticed that the test loops execute VERY quickly
if you want nice stable timings, increase the loop counts so that each test takes at least 0.5 seconds   :t
the loop count does not need to be the same for all the tests
i.e., faster code => larger loop count

Siekmanski

#92
I like your macros, nice style.  :t

Quote from: rrr314159 on April 06, 2015, 03:02:05 AM
BTW my Intel numbers beat siekmanski's, altho his machine is actually better; that's only because I selected the best run out of 20 or 30, a real outlier. My normal runs were behind his by a factor approx. 136 to 117.

Don't get fooled by my machine. I tuned it down to lower the fan-speeds so, it becomes a very quiet but still fast machine.
This is because i use it sometimes for sound recordings.  :biggrin:

EDIT:
Quote
I also did find some speed improvements in my routines ( put in some linear algebra )

Thought i replace,
        mov   edx,16384*2
        sub    edx,eax
with: xor    eax,-1
And gain some speed, was i wrong here.

xor    eax,-1        0.86 cycles

mov   edx,16384*2
sub    edx,eax     together 0.53 cycles

Wasted 1 hour programming....  :(

Added a sincos routine. When i'm done testing i'll post them.
BTW i'm still curious for your faster fpu routine you mentioned a few posts before.

Creative coders use backward thinking techniques as a strategy.

rrr314159

#93
Quote from: Siekmanski on April 06, 2015, 06:14:49 AMI like your macros, nice style.

- thanks ... BTW nidud mentioned he preferred proc's. When I want a proc for small math algos like this (for instance making a library) I write the macro and create proc's from it e.g.

sin_high_precision_single proc   ; assuming inputs are in xmm0 (or u could pass them to proc)
    _USEDOUBLE = 0
    MORE_PRECISION = 1
    trigSSE 0
ret
sin_high_precision_single endp

cos_low_precision_double proc   ; assuming inputs are in xmm0
   _USEDOUBLE = 1
   MORE_PRECISION = 0
   trigSSE 1
ret
cos_low_precision_double endp

...etc You can also pass flags to the proc (highprecisionflag:BOOL etc) and use multiple invocations within it. This way one macro is used to make all the procs you need. For maintenance just one macro needs to be fixed / changed, and recompile the proc's. Of course there are many cases where such a technique is meaningless, and I just write proc in the normal way.

What do u think of such an approach - in cases where applicable, which often happens with these small math algos - is this commonly done?

Quote from: Siekmanski on April 06, 2015, 06:14:49 AM...i use it sometimes for sound recordings.  :biggrin:

- Not that it's any of my business but if u feel like mentioning, what do u record? I'm pseudo-professional bass player myself

Quote from: Siekmanski on April 06, 2015, 06:14:49 AMThought i replace,
        mov   edx,16384*2
        sub    edx,eax
with: xor    eax,-1
And gain some speed, was i wrong here.

xor    eax,-1        0.86 cycles

mov   edx,16384*2
sub    edx,eax     together 0.53 cycles

- Actually "neg eax" is correct, not xor, altho depending how u use the LUT (0..16383 or 1..16384) and rounding mode xor might give the right result anyway. But I wonder if neg is faster?

Quote from: Siekmanski on April 06, 2015, 06:14:49 AMWasted 1 hour programming....  :(

- NOT wasted (IMHO). Had this argument often working in R&D - we'd spend 6 months on some possible approach, wind up proving it was no good, drove budget people crazy. They figured we could have just done nothing, same result, saved half a million bucks! But then we wouldn't know the approach didn't work. That's R&D for you.


Quote from: Siekmanski on April 06, 2015, 06:14:49 AMBTW i'm still curious for your faster fpu routine you mentioned a few posts before.

- Sure - The routine is more-or-less done, but I couldn't face updating test_support_macros to try it out! That's the hard part. If u don't mind testing it yourself, here it is. From your POV, the best part is it uses normal default rounding mode. Could be a made a little faster I guess ... let me know if it works ;)

;;*******************************
;; rrr314159 2015/03/30 fpu-based Power Series for Sin and Cos
;;******************************* use default round-to-nearest
MORE_PRECISION = 1              ;; set to 0 for faster
;;*******************************
trigFPU MACRO sincosflag:=<0>   ;; input and output in st
;;*******************************
    IFNDEF cheby_0
        .const                 
            align 16
            pi real8 3.1415926535897931 ;; generic constants
            oneoverpi real8 0.31830988618379067
            piovertwo real8 1.5707963267948966
           
            cheby_0 real8 0.99999660  ;; 4 coeffs approx 10^-7
            cheby_1 real8 -0.16664824
            cheby_2 real8 0.00830629
            cheby_3 real8 -0.00018363
       
            ch0 real8 0.999695        ;; 3 coeffs approx 10^-5
            ch1 real8 -0.1656700
            ch2 real8 0.0075134
        .code
    ENDIF
   
    IF sincosflag
        fadd piovertwo
    ENDIF
   
    fmul oneoverpi              ;; div by pi to put in two hemispheres
    fld st
    fistp qword ptr [esp-8]
    mov eax, dword ptr [esp-8]  ;; eax has (lower word of) rounded quotient x / pi
    fild qword ptr [esp-8]
    fsub                        ;; now mod 1 (-.5 to .4999999) meaning -pi/2 to pi/2
    test eax, 1
    jz @F
        fchs                    ;; -x for left hemisphere
    @@:
    fmul pi                     ;; restore scale
    fld st
    fmul st, st                 ;; x^2, x
   
    IF MORE_PRECISION
        firstone cheby_3        ;; cheby_3*x^2, x^2, x
        addone cheby_2          ;; and so on
        addone cheby_1
        fxch
        fstp st
        addone cheby_0
    ELSE
        firstone ch2
        addone ch1
        fxch
        fstp st
        addone ch0
    ENDIF
    fxch
    fstp st                     ;; answer in st(0) all other regs free
ENDM
;--------------------------------
firstone MACRO __coeff:REQ     ;; utility MACRO for trigFPU
    fld __coeff
    fmul st, st(1)
ENDM
;--------------------------------
addone MACRO __coeff:REQ       ;; utility MACRO for trigFPU
    fadd __coeff
    fmul st, st(1)
ENDM
;;*******************************
I am NaN ;)

Siekmanski

Quote from: rrr314159 on April 06, 2015, 12:28:06 PMWhat do u think of such an approach - in cases where applicable, which often happens with these small math algos - is this commonly done?

I think so, why not.
My intention is to use the trig functions for my D3Dmath.lib so, i'll create them as proc's.
Rewrite them as macros when i need some extra speed.

Quote from: rrr314159 on April 06, 2015, 12:28:06 PM- Not that it's any of my business but if u feel like mentioning, what do u record? I'm pseudo-professional bass player myself

It's for realtime voice recording for an FM-broadcast proggy i'm writing. (and i just hate noisy computers)

Quote from: rrr314159 on April 06, 2015, 12:28:06 PM- Actually "neg eax" is correct, not xor, altho depending how u use the LUT (0..16383 or 1..16384) and rounding mode xor might give the right result anyway. But I wonder if neg is faster?


A)  mov     edx,16384*2
    sub     edx,eax
    fld     real4 ptr [SSE2SinTableSP+edx*4-4] 

B)  xor     eax,-1
    fld     real4 ptr [SSE2SinTableSP+(16384*4*2)+eax*4] 

C)  neg     eax
    fld     real4 ptr [SSE2SinTableSP+(16384*4*2)+eax*4-4]


B is slower than A. C seems the fastest on my machine.
Have to test it and pick the fasted variant for each quadrant.

Thanks for the routine, i'll test it next week, no time for programming this week.
Creative coders use backward thinking techniques as a strategy.

dedndave

not eax
inc eax

is equiv to neg eax

xor eax,-1 is the same as not eax   :biggrin:

i would think neg eax is fastest

Gunther

Dave,

Quote from: dedndave on April 07, 2015, 11:44:25 PM
not eax
inc eax

is equiv to neg eax

xor eax,-1 is the same as not eax   :biggrin:

That's true.

Quote from: dedndave on April 07, 2015, 11:44:25 PM
i would think neg eax is fastest

Assume nothing. We should meassure it.

Gunther
You have to know the facts before you can distort them.

nidud

#97
deleted

rrr314159

@dedndave - if u remember we had approx. the same issue 3 weeks ago in this thread. We saw that u could substitute neg x for "dec / and" (same as "not / inc"), and mabdelouahab determined it was faster - which is very much what one would suppose.

In the present case it's used in a large LUT so the difference of "1" can be compensated in other ways; still neg is about the same speed as "not", and accurate, so why not use it.

Quote from: GuntherAssume nothing. We should meassure it.

- mabdelouahab did measure it in that thread (almost the same instructions). While I agree with you, of course, remember there are problems with measuring - different machines and situations lead to very different results.

Quote from: nidudI will assume the size of the instruction plays a bigger role in this case...

- read above Gunther quote! ;) U should measure whether the size of the instruction plays a bigger role - and if so, just how big? Still, at some point it gets ridiculous; one spends one's whole life measuring, never gets around to doing ...

Quote from: nidudThe problem with this is that it's difficult to control the alignment in a macro where it may behave differently according to where it's used...

- I just start the macro with an alignment instruction, when appropriate. Admittedly it may behave differently anyway depending how it's used. What I do, make sure resulting program performs as expected; seems it always does.

Bottom line, I think measuring at the unit level can be overdone (not testing - that can never be overdone); at some point, just do system integration and make sure the system works right. You have to do that ultimately anyway.
I am NaN ;)

nidud

#99
deleted

Gunther

Hi nidud,

Quote from: nidud on April 08, 2015, 04:31:57 AM
That's definitely true.

On the other hand a bit irritating using days fine-tuning based on wrong assumptions, so the more accurate the test the better.

sure. But it's a short instruction sequenz to test, so RDTSC or perhaps RDTSCP should do a good job.

Gunther
You have to know the facts before you can distort them.

Farabi

Whoa, Thanks, Very cool :t jack of all trade.
http://farabidatacenter.url.ph/MySoftware/
My 3D Game Engine Demo.

Contact me at Whatsapp: 6283818314165

rrr314159

Here is a version of the trig Sin / Cos algo for AVX. It's written in what-I-call "Orthogonal" instruction set, so that one algo works for AVX or SSE, double or single precision. Each instruction is a macro which implements one of the four possibilities depending on flags.

The AVX implementation is 30 or 40 percent faster, so I guess it's worth it. Still it's so much trouble it's not clear. Without AVX2 u have to shuffle between high and low lanes a lot, using such commands as vextractf128, vinsertf128, vpunpckhdq, vpunpckldq, vshufd, and vperm2f128. You often have to perform ops on high and low lanes separately, temp-store in memory, other annoying techniques.

Please let me know if there are better ways to do macros like rhadd, rhmax, rmovs8, and especially rcvt2negmask.

Is AVX worth this trouble? Since AVX2 has only been around a couple of years it will not be in general use for a long time, so I can't count on its better integer instructions. AVX seems to have similar status to 64-bit - sounds good but one can do without it. How many of you simply don't bother with AVX?

The program trig.exe runs only higher-precision sin and cos, for all 4 Instruction Sets. I'm not bothering to show timings but the zip includes them. In the zip are

trig.asm - see control values at top for different ways to run it
test_support_macros.asm - uses Orthogonal instructions
trig_macros.asm - the actual trig algo very simple compared to supporting routines
localmacs.asm - mainly for printing SSE/AVX registers pretty messy
SSEAVX48.asm - the "Orthogonal" macros
trig.exe
dotrig.bat - "Make" batch file
trigtimings AVXSSE single double.out - timings results

Here are these macros for AVX/SSE single/double.

;; need data statements because 7fffffffffffffffh is too long a constant
;; and for temp data storage
IFNDEF absmask8
.const
    absmask8 dq 4 dup (7fffffffffffffffh)   ; NOT the sign bit, for double
    absmask4 dd 8 dup (7fffffffh)           ; NOT the sign bit, for single
    negmask8 dq 4 dup (8000000000000000h)   ; the sign bit, for double
    negmask4 dd 8 dup (80000000h)           ; the sign bit, for single
.data?
    align 16
    temp8 real8 4 dup(?)
    temp4 real4 8 dup(?)
.code
ENDIF

_DOUBLE equ 1
_SINGLE equ 0
_AVX equ 1
_SSE equ 0
ISTypeXY textequ <X>
WORD_SIZE = 4
ISType textequ <X4>

scratchrmm equ rmm3
scratchxmm equ xmm3

;;*******************************
;;Define all the equates used in an "Orthogonal" algo
;;*******************************
ChangeInstructionSet MACRO _USEDOUBLE:=<0>, _USEAVX:=<0>
;;*******************************
    IF _USEAVX
        rmm0 TEXTEQU <ymm0>
        rmm1 TEXTEQU <ymm1>
        rmm2 TEXTEQU <ymm2>
        rmm3 TEXTEQU <ymm3>
        rmm4 TEXTEQU <ymm4>
        rmm5 TEXTEQU <ymm5>
        rmm6 TEXTEQU <ymm6>
        rmm7 TEXTEQU <ymm7>
    ELSE
        rmm0 TEXTEQU <xmm0>
        rmm1 TEXTEQU <xmm1>
        rmm2 TEXTEQU <xmm2>
        rmm3 TEXTEQU <xmm3>
        rmm4 TEXTEQU <xmm4>
        rmm5 TEXTEQU <xmm5>
        rmm6 TEXTEQU <xmm6>
        rmm7 TEXTEQU <xmm7>
    ENDIF
   
    ;; "O" generally refers to xmm or ymm register
   
    IF _USEAVX
        O_SIZE = 32
        rmmword equ <ymmword>
        op equ ymmword ptr
        ISTypeXY textequ <Y>         ;; instruction set type
    ELSE
        O_SIZE = 16
        rmmword textequ <xmmword>
        op equ xmmword ptr
        ISTypeXY textequ <X>
    ENDIF
   
    WORD_SIZE = 4 * (_USEDOUBLE + 1)

    CHUNK = O_SIZE/WORD_SIZE          ;; (SSE) CHUNK = 2 double, 4 single
    WORD_SIZE$ textequ %WORD_SIZE       ;; AVX = 4 and 8
    ISType textequ ISTypeXY,WORD_SIZE$    ;; ISType is X4, X8, Y4, Y8
    rtype textequ <real>,WORD_SIZE$
    rptr textequ <real>,WORD_SIZE$,< ptr>
    tempreal textequ <temp>,WORD_SIZE$
    absmask textequ <absmask>,WORD_SIZE$
    negmask textequ <negmask>,WORD_SIZE$

ENDM
;;*******************************
;================================
rmov MACRO x1, x2
    IFIDN ISType, <X4>
        movups x1, x2
    ELSEIFIDN ISType, <X8>
        movupd x1, x2
    ELSEIFIDN ISType, <Y4>
        vmovups x1, x2
    ELSE    ;; Y8
        vmovupd x1, x2
    ENDIF
ENDM
;================================
radd MACRO x1, x2
    IFIDN ISType, <X4>
        addps x1, x2
    ELSEIFIDN ISType, <X8>
        addpd x1, x2
    ELSEIFIDN ISType, <Y4>
        vaddps x1, x1, x2
    ELSE    ;; Y8
        vaddpd x1, x1, x2
    ENDIF
ENDM
;================================
rsub MACRO x1, x2
    IFIDN ISType, <X4>
        subps x1, x2
    ELSEIFIDN ISType, <X8>
        subpd x1, x2
    ELSEIFIDN ISType, <Y4>
        vsubps x1, x1, x2
    ELSE    ;; Y8
        vsubpd x1, x1, x2
    ENDIF
ENDM
;================================
rmul MACRO x1, x2
    IFIDN ISType, <X4>
        mulps x1, x2
    ELSEIFIDN ISType, <X8>
        mulpd x1, x2
    ELSEIFIDN ISType, <Y4>
        vmulps x1, x1, x2
    ELSE    ;; Y8
        vmulpd x1, x1, x2
    ENDIF
ENDM
;================================
rsqrsrcintodest MACRO x1, x2
    IFIDN ISType, <X4>
        movups x1, x2
        mulps x1, x1
    ELSEIFIDN ISType, <X8>
        movupd x1, x2
        mulpd x1, x1
    ELSEIFIDN ISType, <Y4>
        vmulps x1, x2, x2
    ELSE
        vmulpd x1, x2, x2
    ENDIF
ENDM
;================================
rmax MACRO x1, x2
    IFIDN ISType, <X4>
        maxps x1, x2
    ELSEIFIDN ISType, <X8>
        maxpd x1, x2
    ELSEIFIDN ISType, <Y4>
        vmaxps x1, x1, x2
    ELSE    ;; Y8
        vmaxpd x1, x1, x2
    ENDIF
ENDM
;================================
rand MACRO x1, x2
    IFIDN ISType, <X4>
        andps x1, x2
    ELSEIFIDN ISType, <X8>
        andpd x1, x2
    ELSEIFIDN ISType, <Y4>
        vandps x1, x1, x2
    ELSE    ;; Y8
        vandpd x1, x1, x2
    ENDIF
ENDM
;================================
rxor MACRO x1, x2
    IFIDN ISType, <X4>
        xorps x1, x2
    ELSEIFIDN ISType, <X8>
        xorpd x1, x2
    ELSEIFIDN ISType, <Y4>
        vxorps x1, x1, x2
    ELSE    ;; Y8
        vxorpd x1, x1, x2
    ENDIF
ENDM
;================================
rhmax MACRO x1
    IFIDN ISType, <X4>
movhlps scratchrmm, x1
maxps x1, scratchrmm
pshufd scratchrmm, x1, 55h
maxps x1, scratchrmm
    ELSEIFIDN ISType, <X8>
pshufd scratchrmm, x1, 8dh
maxpd x1, scratchrmm
    ELSEIFIDN ISType, <Y4>
        vperm2f128 scratchrmm, x1, x1, 23h
        vmaxps x1, x1, scratchrmm
        vshufps scratchrmm, x1, x1, 04eh
        vmaxps x1, x1, scratchrmm
        vshufps scratchrmm, x1, x1, 0bbh
        vmaxps x1, x1, scratchrmm
    ELSE
        vperm2f128 scratchrmm, x1, x1, 23h
        vmaxpd x1, x1, scratchrmm
        vshufpd scratchrmm, x1, x1, 005h
        vmaxpd x1, x1, scratchrmm
    ENDIF
ENDM
;================================
rhadd MACRO x1
    IFIDN ISType, <X4>
haddps x1, x1
haddps x1, x1
    ELSEIFIDN ISType, <X8>
haddpd x1, x1
    ELSEIFIDN ISType, <Y4>
        vperm2f128 scratchrmm, x1, x1, 23h
        vaddps x1, x1, scratchrmm     ; can also be vhaddps
        vhaddps x1, x1, scratchrmm
        vhaddps x1, x1, scratchrmm
    ELSE
        vperm2f128 scratchrmm, x1, x1, 23h
        vaddpd x1, x1, scratchrmm
        vhaddpd x1, x1, scratchrmm
    ENDIF
ENDM
;================================
rmovs8 MACRO thereal, x1 ;; scalar move low float into real8
    IFIDN ISType, <X4>
movss temp4, x1
fld temp4 ;; must go to this trouble apparently
fstp thereal
    ELSEIFIDN ISType, <X8>
movsd thereal, x1
    ELSEIFIDN ISType, <Y4>
% vmovss temp4, &XMMfromYMM(x1)
fld temp4 ;; must go to this trouble apparently
fstp thereal
    ELSE
% vmovsd thereal, &XMMfromYMM(x1)
    ENDIF
ENDM
;================================
rcvtr2i MACRO x1, x2
    IFIDN ISType, <X4>
        cvtps2dq x1, x2
    ELSEIFIDN ISType, <X8>
cvtpd2dq x1, x2
    ELSEIFIDN ISType, <Y4>
        vcvtps2dq x1, x2
    ELSE
% vcvtpd2dq &XMMfromYMM(x1), x2
    ENDIF
ENDM
;================================
rcvti2r MACRO x1, x2
    IFIDN ISType, <X4>
        cvtdq2ps x1, x2
    ELSEIFIDN ISType, <X8>
cvtdq2pd x1, x2
    ELSEIFIDN ISType, <Y4>
        vcvtdq2ps x1, x2
    ELSE
vcvtdq2pd x1, x2           ;; note using only low 128 bits of x2
    ENDIF
ENDM
;================================
rcvt2negmask MACRO x1              ;; convert xmm1 int's to neg masks
LOCAL thexmm
    IFIDN ISType, <X4>
    pslld x1, 31
    ELSEIFIDN ISType, <X8>
    pslld x1, 31
    pshufd x1, x1, 073h       
    ELSEIFIDN ISType, <Y4>
        vextractf128 scratchxmm, x1, 1
%        thexmm equ &XMMfromYMM(x1)
        vpslld scratchxmm, scratchxmm, 31
        vpslld thexmm, thexmm, 31
        vinsertf128 x1, x1, scratchxmm, 1
    ELSE                   ;; NOTE using xmm2 as scratch also = cheating
        vxorps scratchxmm, scratchxmm, scratchxmm
%        thexmm equ &XMMfromYMM(x1)
        vpslld thexmm, thexmm, 31
        vpunpckhdq xmm2, scratchxmm, thexmm
        vpunpckldq thexmm, scratchxmm, thexmm
        vinsertf128 x1, x1, xmm2, 1
    ENDIF
ENDM
;================================
rpxor MACRO x1, x2              ;; only being used to zero
    IFIDN ISTypeXY, <X>
        pxor x1, x2
    ELSE
        vxorps x1, x1, x2        ;; vpxor is only AVX2
    ENDIF
ENDM
;================================


What do you think, is AVX worth the trouble? If so, is this "orthogonal instruction set" idea worth it?
I am NaN ;)

Gunther

Hi rrr,

Quote from: rrr314159 on April 13, 2015, 02:39:47 PM
What do you think, is AVX worth the trouble? If so, is this "orthogonal instruction set" idea worth it?

my machine at the university doesn't support AVX. So I can test your program this evening at home. I think that AVX is worth the effort. AVX512 is coming soon, too.

Gunther
You have to know the facts before you can distort them.

Siekmanski

#104
Hi, i'm back from a week without programming.
Did improve the speed of the LUT approach.
Although the LUT is a little faster for calculating one value, i must say i like the Chebyshev approximation a lot more.
More precision and easy to use with SIMD instructions.
So, now i'll try to make some more trig functions for my D3Dmath.lib

Hi rrr, here is another one with much higher precision, a 9th degree polynomial and only 5 coeffs ( maximum error of about 3.3381e-9 )

.data
align 16
OneDivPiDP      real8 2 dup (0.31830988618379067153776752674)
PiDP            real8 2 dup (3.14159265358979323846264338327)

ChebyCh0        real8 2 dup (0.99999997658988206732793421604)
ChebyCh1        real8 2 dup (-0.16666647634639712527586027070)
ChebyCh2        real8 2 dup (0.00833289982335175125347370686)
ChebyCh3        real8 2 dup (-0.00019800897762795431268299998)
ChebyCh4        real8 2 dup (0.00000259048850053605227412420)

.code

SSE2cheby9 proc

    mulpd       xmm0,oword ptr OneDivPiDP   ; 1/pi to get a 1pi range
    cvtpd2dq    xmm2,xmm0                   ; (2 packed dpfp to 2 even int32) lose the fractional parts and keep it in xmm2 to save the signs
    cvtdq2pd    xmm1,xmm2                   ; (2 even packed int32 to 2 dpfp) save the integral parts
    subpd       xmm0,xmm1                   ; now it's inside the range, results are values between -0.5 to 0.4999999
    pslld       xmm2,31
    pshufd      xmm2,xmm2,73h         
    xorpd       xmm0,xmm2                   ; set sign-bits
    mulpd       xmm0,oword ptr PiDP         ; restore ranges between -1/2 pi to +1/2 pi

; Now do the Chebyshev approximation of a 9th degree polynomial
; With only 5!! optimized constants for a maximum error of about 3.3381e-9 over -1/2 pi to +1/2 pi

    movapd      xmm2,xmm0
    mulpd       xmm2,xmm2

    movapd      xmm1,oword ptr ChebyCh4
    mulpd       xmm1,xmm2
    addpd       xmm1,oword ptr ChebyCh3
    mulpd       xmm1,xmm2
    addpd       xmm1,oword ptr ChebyCh2
    mulpd       xmm1,xmm2
    addpd       xmm1,oword ptr ChebyCh1
    mulpd       xmm1,xmm2
    addpd       xmm1,oword ptr ChebyCh0
    mulpd       xmm0,xmm1
    ret
SSE2cheby9 endp
Creative coders use backward thinking techniques as a strategy.