News:

Masm32 SDK description, downloads and other helpful links
Message to All Guests

Main Menu

Trigonometry ...

Started by rrr314159, March 26, 2015, 04:48:39 PM

Previous topic - Next topic

rrr314159

These routines replace fsin and fcos using the FPU. They're based on Taylor Series for sin and cos. I had been using sin/cos lookup tables (4096 entries) which are a bit faster, but for more precision the LUT gets very large. But the main reason I developed these is because I'm translating all my algos to SSE, and would need the gather instruction (AVX2) to use LUTs. Even if I had them on my machine, most people don't.

I have various requirements for trig routines:

- min precision about 4 decimal digits but capable of (much) higher when a flag is set.
- use no more than 4 FPU registers preferably (could be waived if worth it).
- at least 3 times faster than FPU.
- SIMD - izable, which rules out some techniques.

These two routines, trigC and trigS, are based on Cos and Sin taylor series respectively. trigC is faster but requires one FPU reg per T.S. term (beyond the first "1"). With 4 regs (last term is x^8) it achieves only 5 significant decimal digits. trigS OTOH uses only 4 regs no matter how many terms. This version, at higher precision setting, uses 6 terms, up to x^11, to achieve almost 9 digits. Both routines calculate both sin and cos, with different accuracy patterns (trigS pattern is better, but that gets into too much detail - if interested say so).

On the Intel i5 speed is satisfactory, varying from 4 times faster than FPU to more than 9: only 3 nanoseconds per iteration, with 4 digits precision, enough for some purposes.

However on the AMD they're unsatisfactory. First, they're 3 times slower - rather worse than usual. But the amazing thing is, AMD FPU sin and cos are almost as fast as Intel! Since clock speed is just a little more than half, per cycle AMD is much better.

BTW I've heard it said there's no big difference between Intel and AMD - that's not my experience.

Anyway, I want faster and better routines if possible. These two can be sped up maybe 10% in obvious ways - some of which, no doubt, I'm overlooking - so let me know of any you notice. But also, I'm wondering if there are any tricks that would really make a difference. Some ideas,

- I wonder if fixed point would help. I doubt it.
- By factoring the series you can eliminate one multiply, don't think it's worth it.
- Manipulating the quadrants flag you can eliminate one branch, doesn't seem to help though, the extra memory access kills the advantage.
- Mixing the two Taylor Series in the obvious way - I'm currently investigating that
- I did look on the net, saw various libraries, but it's a lot easier (and almost always better) to just write my own than hassle with them. So please don't just give me a link to a library unless u have reason to think it's worth it.

If you have a candidate routine use my test bed to get stats, or give it to me and I'll do it.

Here is the fastest version I have at the moment, called trigC:

.data                             ;; for both trig routines
    piover2 real8 1.5707963267948966
    twooverpi real8 0.63661977236758138
.code
; »»»»»»»»»»»»»»»»»»»»»»»»»»»»»»»»»»»»»»»»»»»»»»»»»»»»»»»»»»»»»»»»»»»»»»»»»
.data                                   ; data used by trigC MACRO
    __cos1 real8 -1.2337005501361697    ; 1/2 (pi/2)^2
    __cos2 real8 0.16666666666666667    ; 2/(3*4)
    __cos3 real8 0.06666666666666667    ; 2/(5*6)
    __cos4 real8 0.03571428571428571    ; 2/(7*8)
    one real8 1.0
.code
; «««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««
;;*******************************       ;; sin or cos in st(0), uses 4 regs
trigC MACRO sincosflag:=<0>             ;; but needs one extra per term                   
;;*******************************
    fmul twooverpi                      ;; div by pi/2 to put in 4 quadrants
    fld st
    fisttp qword ptr [esp-8]            ;; math "int" truncate, down towards -inf
    mov eax, dword ptr [esp-8]          ;; (lower word of) int quotient x / pi/2
    fild qword ptr [esp-8]

    add eax, sincosflag                 ;; sin if sincosflag = 0, cos if 1                     
    fsub                                ;; now mod 1 (0 to .999999) meaning 0-pi/2

    test eax, 1
    jnz @F
        fld1
        fsubrp                          ;; replace w/ 1-x for these quadrants
    @@:

    fmul st, st
    fmul __cos1
    fld st              ;; c1*x^2, c1*x^2

    fmul st, st(1)      ;; c1^2*x^4, c1*x^2 
    fmul __cos2         ;; c2*c1^2*x^4, c1*x^2
    fld st              ;; c2*c1^2*x^4, c2*c1^2*x^4, c1*x^2

    fmul st, st(2)      ;; c2*c1^3*x^6, c2*c1^2*x^4, c1*x^2
    fmul __cos3         ;; c3*c2*c1^3*x^6, c2*c1^2*x^4, c1*x^2

    IF MORE_PRECISION EQ 1                       ;; do one more term
        fld st          ;; c3*c2*c1^3*x^6, c3*c2*c1^3*x^6, c2*c1^2*x^4, c1*x^2
        fmul st, st(3)  ;; c3*c2*c1^4*x^8, c3*c2*c1^3*x^6, c2*c1^2*x^4, c1*x^2
        fmul __cos4     ;; c4*c3*c2*c1^4*x^8, c3*c2*c1^3*x^6, c2*c1^2*x^4, c1*x^2
        fadd
    ENDIF

    fadd
    fadd
    fadd one                            ;; answer in st(0) all other regs free

    and eax, 2
    je @F
        fchs                            ;; was in a negative quadrant
    @@:
ENDM
;;*******************************


and here are runs with timing and precision stats:

Intel i5 3330 2.94 Ghz
    ----------------------------------------
FPU fsin nanos per iter         32.129
FPU fcos nanos per iter         31.583

Test Function: trigS ========================================

********** SIN Using faster version **********
Nanoseconds per Iteration       4.996
Speed ratio FPU/test fn         6.38

Precision
average precision               1.59e-005
worst precision                 1.57e-004

********** COS Using faster version **********
Nanoseconds per Iteration       4.910
Speed ratio FPU/test fn         6.49

Precision
average precision               1.59e-005
worst precision                 1.57e-004

********** SIN Using higher precision **********
Nanoseconds per Iteration       8.427
Speed ratio FPU/test fn         3.78

Precision
average precision               4.12e-009
worst precision                 5.63e-008

********** COS Using higher precision **********
Nanoseconds per Iteration       8.380
Speed ratio FPU/test fn         3.8

Precision
average precision               4.12e-009
worst precision                 5.63e-008

Test Function: trigC ========================================

********** SIN Using faster version **********
Nanoseconds per Iteration       3.604
Speed ratio FPU/test fn         8.84

Precision
average precision               1.01e-004
worst precision                 8.95e-004

********** COS Using faster version **********
Nanoseconds per Iteration       3.463
Speed ratio FPU/test fn         9.2

Precision
average precision               1.01e-004
worst precision                 8.95e-004

********** SIN Using higher precision **********
Nanoseconds per Iteration       5.211
Speed ratio FPU/test fn         6.11

Precision
average precision               2.29e-006
worst precision                 2.47e-005

********** COS Using higher precision **********
Nanoseconds per Iteration       5.206
Speed ratio FPU/test fn         6.12

Precision
average precision               2.29e-006
worst precision                 2.47e-005

================================================
================================================
AMD A6 1.8 Ghz
    ----------------------------------------
FPU fsin nanos per iter         35.279
FPU fcos nanos per iter         37.223

Test Function: trigS ========================================

********** SIN Using faster version **********
Nanoseconds per Iteration       17.517
Speed ratio FPU/test fn         2.07

Precision
average precision               1.59e-005
worst precision                 1.57e-004

********** COS Using faster version **********
Nanoseconds per Iteration       17.350
Speed ratio FPU/test fn         2.09

Precision
average precision               1.59e-005
worst precision                 1.57e-004

********** SIN Using higher precision **********
Nanoseconds per Iteration       27.397
Speed ratio FPU/test fn         1.32

Precision
average precision               4.12e-009
worst precision                 5.63e-008

********** COS Using higher precision **********
Nanoseconds per Iteration       27.137
Speed ratio FPU/test fn         1.34

Precision
average precision               4.12e-009
worst precision                 5.63e-008

Test Function: trigC ========================================

********** SIN Using faster version **********
Nanoseconds per Iteration       12.161
Speed ratio FPU/test fn         2.98

Precision
average precision               1.01e-004
worst precision                 8.95e-004

********** COS Using faster version **********
Nanoseconds per Iteration       12.103
Speed ratio FPU/test fn         3

Precision
average precision               1.01e-004
worst precision                 8.95e-004

********** SIN Using higher precision **********
Nanoseconds per Iteration       16.754
Speed ratio FPU/test fn         2.16

Precision
average precision               2.29e-006
worst precision                 2.47e-005

********** COS Using higher precision **********
Nanoseconds per Iteration       16.673
Speed ratio FPU/test fn         2.17

Precision
average precision               2.29e-006
worst precision                 2.47e-005


The zip includes trig32.asm with trigS and trigC, plus test_support_macros.asm with the test bed. Writing that was a PITA, more code than the trig routines, and much less fun. The zip also has trig32.exe which produces tables like that shown above.

[edit] apologies to anyone who downloaded this expecting it to be correct - u shld know me better :) qword pointed out mistake. fisttp shld be fistp, with "down towards -inf" rounding. Has no effect on any runs made so far, but would if u used negative inputs...

[edit] please go to Reply #40 for fixed / latest versions
I am NaN ;)

sinsi

AMD A10-7850K 3.7GHz

    --------------------------------------
FPU fsin nanos per iter         30.486
FPU fcos nanos per iter         29.228

Test Function: trigS =====================

********** SIN Using faster version ******
Nanoseconds per Iteration       4.541
Speed ratio FPU/test fn         6.57

Precision
average precision               1.59e-005
worst precision                 1.57e-004

********** COS Using faster version ******
Nanoseconds per Iteration       4.527
Speed ratio FPU/test fn         6.6

Precision
average precision               1.59e-005
worst precision                 1.57e-004

********** SIN Using higher precision ****
Nanoseconds per Iteration       7.935
Speed ratio FPU/test fn         3.76

Precision
average precision               4.12e-009
worst precision                 5.63e-008

********** COS Using higher precision ****
Nanoseconds per Iteration       8.240
Speed ratio FPU/test fn         3.62

Precision
average precision               4.12e-009
worst precision                 5.63e-008

Test Function: trigC =====================

********** SIN Using faster version ******
Nanoseconds per Iteration       2.835
Speed ratio FPU/test fn         10.5

Precision
average precision               1.01e-004
worst precision                 8.95e-004

********** COS Using faster version ******
Nanoseconds per Iteration       2.906
Speed ratio FPU/test fn         10.3

Precision
average precision               1.01e-004
worst precision                 8.95e-004

********** SIN Using higher precision ****
Nanoseconds per Iteration       4.382
Speed ratio FPU/test fn         6.81

Precision
average precision               2.29e-006
worst precision                 2.47e-005

********** COS Using higher precision ****
Nanoseconds per Iteration       4.677
Speed ratio FPU/test fn         6.38

Precision
average precision               2.29e-006
worst precision                 2.47e-005


rrr314159

Thanks Sinsi, your A10 has similar numbers to my i5. Strange how AMD A6, while generally much slower, is so fast for fsin and fcos
I am NaN ;)

MichaelW

Core-i3 3.0 GHz:

    ----------------------------------------
FPU fsin nanos per iter         34.265
FPU fcos nanos per iter         34.241

Test Function: trigS ========================================

********** SIN Using faster version **********
Nanoseconds per Iteration       5.325
Speed ratio FPU/test fn         6.43

Precision
average precision               1.59e-005
worst precision                 1.57e-004

********** COS Using faster version **********
Nanoseconds per Iteration       5.256
Speed ratio FPU/test fn         6.52

Precision
average precision               1.59e-005
worst precision                 1.57e-004

********** SIN Using higher precision **********
Nanoseconds per Iteration       9.030
Speed ratio FPU/test fn         3.79

Precision
average precision               4.12e-009
worst precision                 5.63e-008

********** COS Using higher precision **********
Nanoseconds per Iteration       8.969
Speed ratio FPU/test fn         3.82

Precision
average precision               4.12e-009
worst precision                 5.63e-008

Test Function: trigC ========================================

********** SIN Using faster version **********
Nanoseconds per Iteration       3.778
Speed ratio FPU/test fn         9.07

Precision
average precision               1.01e-004
worst precision                 8.95e-004

********** COS Using faster version **********
Nanoseconds per Iteration       3.645
Speed ratio FPU/test fn         9.4

Precision
average precision               1.01e-004
worst precision                 8.95e-004

********** SIN Using higher precision **********
Nanoseconds per Iteration       5.434
Speed ratio FPU/test fn         6.3

Precision
average precision               2.29e-006
worst precision                 2.47e-005

********** COS Using higher precision **********
Nanoseconds per Iteration       5.444
Speed ratio FPU/test fn         6.29

Precision
average precision               2.29e-006
worst precision                 2.47e-005

E:\Downloads\NaN\trig_functions\trig functions>pause
Press any key to continue . . .



Well Microsoft, here's another nice mess you've gotten us into.

dedndave

p-4 prescott w/htt @ 3 GHz
FPU fsin nanos per iter 0.048

FPU fcos nanos per iter 0.048



Test Function: trigS ========================================



********** SIN Using faster version **********

Nanoseconds per Iteration 0.029

Speed ratio FPU/test fn 1.63



Precision

average precision 1.59e-005

worst precision    1.57e-004



********** COS Using faster version **********

Nanoseconds per Iteration 0.030

Speed ratio FPU/test fn 1.6



Precision

average precision 1.59e-005

worst precision    1.57e-004



********** SIN Using higher precision **********

Nanoseconds per Iteration 0.032

Speed ratio FPU/test fn 1.51



Precision

average precision 4.12e-009

worst precision    5.63e-008



********** COS Using higher precision **********

Nanoseconds per Iteration 0.031

Speed ratio FPU/test fn 1.51



Precision

average precision 4.12e-009

worst precision    5.63e-008



Test Function: trigC ========================================



********** SIN Using faster version **********

Nanoseconds per Iteration 0.029

Speed ratio FPU/test fn 1.63



Precision

average precision 1.01e-004

worst precision    8.95e-004



********** COS Using faster version **********

Nanoseconds per Iteration 0.029

Speed ratio FPU/test fn 1.63



Precision

average precision 1.01e-004

worst precision    8.95e-004



********** SIN Using higher precision **********

Nanoseconds per Iteration 0.030

Speed ratio FPU/test fn 1.6



Precision

average precision 2.29e-006

worst precision    2.47e-005



********** COS Using higher precision **********

Nanoseconds per Iteration 0.030

Speed ratio FPU/test fn 1.6



Precision

average precision 2.29e-006

worst precision    2.47e-005

TWell

AMD E1-6010 1.35 GHz    ----------------------------------------
FPU fsin nanos per iter 60.574
FPU fcos nanos per iter 63.197

Test Function: trigS ========================================

********** SIN Using faster version **********
Nanoseconds per Iteration 29.655
Speed ratio FPU/test fn 2.09

Precision
average precision 1.59e-005
worst precision    1.57e-004

********** COS Using faster version **********
Nanoseconds per Iteration 30.067
Speed ratio FPU/test fn 2.06

Precision
average precision 1.59e-005
worst precision    1.57e-004

********** SIN Using higher precision **********
Nanoseconds per Iteration 51.286
Speed ratio FPU/test fn 1.21

Precision
average precision 4.12e-009
worst precision    5.63e-008

********** COS Using higher precision **********
Nanoseconds per Iteration 48.504
Speed ratio FPU/test fn 1.28

Precision
average precision 4.12e-009
worst precision    5.63e-008

Test Function: trigC ========================================

********** SIN Using faster version **********
Nanoseconds per Iteration 20.868
Speed ratio FPU/test fn 2.97

Precision
average precision 1.01e-004
worst precision    8.95e-004

********** COS Using faster version **********
Nanoseconds per Iteration 20.494
Speed ratio FPU/test fn 3.02

Precision
average precision 1.01e-004
worst precision    8.95e-004

********** SIN Using higher precision **********
Nanoseconds per Iteration 28.468
Speed ratio FPU/test fn 2.17

Precision
average precision 2.29e-006
worst precision    2.47e-005

********** COS Using higher precision **********
Nanoseconds per Iteration 29.783
Speed ratio FPU/test fn 2.08

Precision
average precision 2.29e-006
worst precision    2.47e-005

MichaelW

Core-i3 3.0 GHz:
FPU fsin nanos per iter         34.265


p-4 prescott w/htt @ 3 GHz
FPU fsin nanos per iter    0.048

A P4 with the same clock speed and a lower IPC is ~713 times faster?
Well Microsoft, here's another nice mess you've gotten us into.

jj2007

Core i5:
FPU fsin nanos per iter         33.359
FPU fcos nanos per iter         32.851

Test Function: trigS ========================================

********** SIN Using faster version **********
Nanoseconds per Iteration       4.705
Speed ratio FPU/test fn         7.04

Precision
average precision               1.59e-005
worst precision                 1.57e-004

********** COS Using faster version **********
Nanoseconds per Iteration       4.586
Speed ratio FPU/test fn         7.22

Precision
average precision               1.59e-005
worst precision                 1.57e-004

********** SIN Using higher precision **********
Nanoseconds per Iteration       8.167
Speed ratio FPU/test fn         4.05

Precision
average precision               4.12e-009
worst precision                 5.63e-008

********** COS Using higher precision **********
Nanoseconds per Iteration       8.208
Speed ratio FPU/test fn         4.03

Precision
average precision               4.12e-009
worst precision                 5.63e-008

Test Function: trigC ========================================

********** SIN Using faster version **********
Nanoseconds per Iteration       3.212
Speed ratio FPU/test fn         10.3

Precision
average precision               1.01e-004
worst precision                 8.95e-004

********** COS Using faster version **********
Nanoseconds per Iteration       3.200
Speed ratio FPU/test fn         10.3

Precision
average precision               1.01e-004
worst precision                 8.95e-004

********** SIN Using higher precision **********
Nanoseconds per Iteration       4.908
Speed ratio FPU/test fn         6.75

Precision
average precision               2.29e-006
worst precision                 2.47e-005

********** COS Using higher precision **********
Nanoseconds per Iteration       4.911
Speed ratio FPU/test fn         6.74

Precision
average precision               2.29e-006
worst precision                 2.47e-005

hutch--

On my i7.


FPU fsin nanos per iter         30.473
FPU fcos nanos per iter         30.446

Test Function: trigS ========================================

********** SIN Using faster version **********
Nanoseconds per Iteration       8.907
Speed ratio FPU/test fn         3.42

Precision
average precision               1.59e-005
worst precision                 1.57e-004

********** COS Using faster version **********
Nanoseconds per Iteration       9.117
Speed ratio FPU/test fn         3.34

Precision
average precision               1.59e-005
worst precision                 1.57e-004

********** SIN Using higher precision **********
Nanoseconds per Iteration       12.853
Speed ratio FPU/test fn         2.37

Precision
average precision               4.12e-009
worst precision                 5.63e-008

********** COS Using higher precision **********
Nanoseconds per Iteration       12.999
Speed ratio FPU/test fn         2.34

Precision
average precision               4.12e-009
worst precision                 5.63e-008

Test Function: trigC ========================================

********** SIN Using faster version **********
Nanoseconds per Iteration       6.278
Speed ratio FPU/test fn         4.85

Precision
average precision               1.01e-004
worst precision                 8.95e-004

********** COS Using faster version **********
Nanoseconds per Iteration       5.839
Speed ratio FPU/test fn         5.22

Precision
average precision               1.01e-004
worst precision                 8.95e-004

********** SIN Using higher precision **********
Nanoseconds per Iteration       8.680
Speed ratio FPU/test fn         3.51

Precision
average precision               2.29e-006
worst precision                 2.47e-005

********** COS Using higher precision **********
Nanoseconds per Iteration       8.861
Speed ratio FPU/test fn         3.44

Precision
average precision               2.29e-006
worst precision                 2.47e-005

nidud

#9
deleted

Siekmanski

#10
Windows 8.1  Intel(R) Core(TM) i7-4930K CPU @ 3.40GHz


    ----------------------------------------
FPU fsin nanos per iter         6.956
FPU fcos nanos per iter         6.860

Test Function: trigS ========================================

********** SIN Using faster version **********
Nanoseconds per Iteration       1.051
Speed ratio FPU/test fn         6.57

Precision
average precision               1.59e-005
worst precision                 1.57e-004

********** COS Using faster version **********
Nanoseconds per Iteration       1.031
Speed ratio FPU/test fn         6.7

Precision
average precision               1.59e-005
worst precision                 1.57e-004

********** SIN Using higher precision **********
Nanoseconds per Iteration       1.835
Speed ratio FPU/test fn         3.76

Precision
average precision               4.12e-009
worst precision                 5.63e-008

********** COS Using higher precision **********
Nanoseconds per Iteration       1.787
Speed ratio FPU/test fn         3.86

Precision
average precision               4.12e-009
worst precision                 5.63e-008

Test Function: trigC ========================================

********** SIN Using faster version **********
Nanoseconds per Iteration       0.749
Speed ratio FPU/test fn         9.23

Precision
average precision               1.01e-004
worst precision                 8.95e-004

********** COS Using faster version **********
Nanoseconds per Iteration       0.718
Speed ratio FPU/test fn         9.62

Precision
average precision               1.01e-004
worst precision                 8.95e-004

********** SIN Using higher precision **********
Nanoseconds per Iteration       1.096
Speed ratio FPU/test fn         6.3

Precision
average precision               2.29e-006
worst precision                 2.47e-005

********** COS Using higher precision **********
Nanoseconds per Iteration       1.096
Speed ratio FPU/test fn         6.3

Precision
average precision               2.29e-006
worst precision                 2.47e-005


Creative coders use backward thinking techniques as a strategy.

dedndave

anyone notice that my P4 and nidud's Athlon are kicking ass ?

maybe something not right with the test program ?

rrr314159

Thanks Marinus,

But, (AFAIK) u need vgather instruction to use LUT with SSE, which is no good for me because, even if I had it, very few others do. Maybe your routine has some way to deal with its absence, I'll see (this evening, busy with "life" stuff now). Your i7 is scary fast, I gotta have one of those!

BTW LUT is used only for first quadrant as a matter of course, the other quadrants are symmetric - if u know anyone using a LUT all around the circle, ... well, don't let them perform any work where numerical skills are vital, let's put it that way.

My LUT currently is only 4096, I did make a version with 16384, even 65536 just for the heck of it, but that was overkill. One problem is that, although precision vs. speed is rather better than Taylor Series, it's not smooth, but stair-steps between the LUT points. For graphics (my main use these days) a less precise, but smooth, curve is much better than a more precise jagged one. (Of course u can interpolate between LUT points to give an n-gon shape but that totally kills speed advantage). Considering that, the LUT and T.S. approaches are fairly equal. The T.S. is easier to deal with. However nice thing about LUT, easily adaptable to other periodic curves such as sin^2, can even make custom curves (user makes with the mouse) that can look cool or weird. Bottom line - I'd stick with LUT if I, and "almost all" users, had vgather.

@dedndave and MichaelW, yes something is definitely wrong w/ test program on those old machines, Prescott and Athlon II! I guess - they don't have QueryPerformanceFrequency instruction? (Or, mfence ... but why didn't they just blow up...?) So they're trying to divide by zero, or random garbage. Sorry. Still don't feel you've wasted your time (dedndave and nidud) those results are more valuable than others, to remind me there are still some computers out there that need alternate routines. Perhaps, for them, I can keep count by making notches in a clay tablet ... At least the actual calc's are being done right, as u can tell from the precision numbers, which should be (and are) identical.

All the other times look right, thanks to all.

[edit] Siekmanski, put life on hold for a moment and looked at your routine, didn't realize it was so simple. Looks like better way to handle LUT than way I've been doing it (which has about twice as many instructions), but - it only looks up one value at a time, right? Not using SIMD - that's why there's no vgather. Presumably T.S. (which can easily calc 4 or 8, whatever, values at once, with SSE) will turn out to be better - will let u know.
I am NaN ;)

Antariy

rrr, what is "T.S"?

The timings on Celeron D310

    ----------------------------------------
FPU fsin nanos per iter         78.477
FPU fcos nanos per iter         78.188

Test Function: trigS ========================================

********** SIN Using faster version **********
Nanoseconds per Iteration       26.469
Speed ratio FPU/test fn         2.96

Precision
average precision               1.59e-005
worst precision                 1.57e-004

********** COS Using faster version **********
Nanoseconds per Iteration       26.556
Speed ratio FPU/test fn         2.95

Precision
average precision               1.59e-005
worst precision                 1.57e-004

********** SIN Using higher precision **********
Nanoseconds per Iteration       43.149
Speed ratio FPU/test fn         1.82

Precision
average precision               4.12e-009
worst precision                 5.63e-008

********** COS Using higher precision **********
Nanoseconds per Iteration       43.107
Speed ratio FPU/test fn         1.82

Precision
average precision               4.12e-009
worst precision                 5.63e-008

Test Function: trigC ========================================

********** SIN Using faster version **********
Nanoseconds per Iteration       22.383
Speed ratio FPU/test fn         3.5

Precision
average precision               1.01e-004
worst precision                 8.95e-004

********** COS Using faster version **********
Nanoseconds per Iteration       22.422
Speed ratio FPU/test fn         3.49

Precision
average precision               1.01e-004
worst precision                 8.95e-004

********** SIN Using higher precision **********
Nanoseconds per Iteration       32.139
Speed ratio FPU/test fn         2.44

Precision
average precision               2.29e-006
worst precision                 2.47e-005

********** COS Using higher precision **********
Nanoseconds per Iteration       32.084
Speed ratio FPU/test fn         2.44

Precision
average precision               2.29e-006
worst precision                 2.47e-005

dedndave

T.S. = Taylor Series   :P

this machine has QueryPerformanceCounter
something not quite right, there   :redface:
maybe you do a MUL and disregard the high dword ? - something along those lines