Hi siekmanski,

glad u asked, I was just about to post this update ...

Here are my current versions of trig routines. There are 2 of them, Taylor Series and LUT.

T.S. - I used a technique from Jean-Michel Muller, from the SSE routine qWord donated, to improve trigS (the sin-T.S. routine.) It gets up to 11 significant digits; or, in reduced precision mode, gets 5-6 digits 9 times faster than FPU.

BTW I've made a SIMD version of that Muller algo (much modified), sped it up somewhat, but it's still awfully slow. Will have something to show one of these days, for now both these 2 routines are clearly superior, in spite of doing only one input at a time.

The previously posted trigC, based on cos T.S., is still pretty good as is, not re-posted here. This one is prob a bit better.

LUT - siekmanski, your latest version seems as precise as can be done, but I realized why I don't have these problems: I use FPU. Even with real4 data FPU does intermediate calc's to 80 bits so it never loses precision; all the way up to 10^38 gives the same results. So I'm posting my FPU LUT technique, called trigLUT. Using your 16384 entry table, thanks. I suppose it's slower than your lookup routine but for me the extra precision is important. Goes 25 times faster than FPU, not far from 1 nanosecond per number.

One reason for this thread was to find out "what I'm doing wrong". I always use "truncate down to -infinity" rounding mode, compatible with mathematics int function, but everybody else uses the default "round-to-nearest". Although that's much clumsier (to me) I guess it's just as good and has the advantage of being standard. So probably I should convert to that. However haven't done so yet, so these routines still truncate to -inf. A lot easier to understand, isn't it?

Also I should use the standard timers.asm, but again it's not very comfortable to me so I've still got my own routines here. Sorry ... I believe the numbers are reasonably accurate but their main purpose is just relative comparison

Here is the LUT routine:

`;;*******************************`

trigLUT MACRO sincosflag:=<0> ;; rrr314159 2015/03/30 fpu-based lookup

;;*******************************

fmul twooverpi ;; div by pi/2 to put in 1st quadrant

fld st

fistp qword ptr [esp-8] ;; math-like truncate, down towards -inf

mov eax, dword ptr [esp-8] ;; eax has (lower word of) int quotient x / pi/2

fild qword ptr [esp-8]

fsub ;; now mod 1 (0 to .999999) meaning 0-pi/2

add eax, sincosflag ;; this converts to cos when sincosflag = 1

test eax, 1

jz @F

fld1

fsubrp ;; replace w/ 1-edx for these quadrants

@@:

fmul FLT8(16384.0)

fistp dword ptr [esp-4] ;; get table index

mov edx, dword ptr [esp-4]

fld real4 ptr [sseSinTable+edx*4]

and eax, 2

je @F

fchs ;; was in a negative quadrant

@@:

ENDM

;;*******************************

trigS is longer and messier so please check the zip for that. Here is a typical precision / timing run:

`FPU average nanos per iter 32.600`

=======================================================================

Test Function: trigS ========================================

=======================================================================

********** SIN Using faster version **********

Nanoseconds per Iteration 3.589

Speed ratio FPU/test fn 9.08

Precision

average precision 1.58e-005

worst precision 1.53e-004

********** COS Using faster version **********

Nanoseconds per Iteration 3.584

Speed ratio FPU/test fn 9.1

Precision

average precision 1.57e-005

worst precision 1.48e-004

********** SIN Using higher precision **********

Nanoseconds per Iteration 7.389

Speed ratio FPU/test fn 4.41

Precision

average precision 2.93e-011

worst precision 4.59e-011

********** COS Using higher precision **********

Nanoseconds per Iteration 7.407

Speed ratio FPU/test fn 4.4

Precision

average precision 2.93e-011

worst precision 4.58e-011

=======================================================================

Test Function: trigLUT ========================================

=======================================================================

********** SIN Using higher precision **********

Nanoseconds per Iteration 1.295

Speed ratio FPU/test fn 25.2

Precision

average precision 1.52e-005

worst precision 4.41e-005

********** COS Using higher precision **********

Nanoseconds per Iteration 1.306

Speed ratio FPU/test fn 25

Precision

average precision 1.53e-005

worst precision 4.77e-005

The zip contains

trig32.asm ; the main routines (called 32 because there's also a 64-bit version, BTW)

test_support_macros ; lots of support macros, could stand some cleaning up

trig32.exe

doj32.bat ; "do JWasm 32-bit" compiling batch file, just for reference

[edit] woops, forget to comment out the line "MY_CPU=1" at top of trig32.asm. Just comment that out. Unless you know your own CPU's FPU sin/cos average speed, in which case put it below at "MY_CPU_NANOS". This saves having to recompute it every time you run the prog.