The MASM Forum

General => The Laboratory => Topic started by: rrr314159 on March 26, 2015, 04:48:39 PM

Title: Trigonometry ...
Post by: rrr314159 on March 26, 2015, 04:48:39 PM
These routines replace fsin and fcos using the FPU. They're based on Taylor Series for sin and cos. I had been using sin/cos lookup tables (4096 entries) which are a bit faster, but for more precision the LUT gets very large. But the main reason I developed these is because I'm translating all my algos to SSE, and would need the gather instruction (AVX2) to use LUTs. Even if I had them on my machine, most people don't.

I have various requirements for trig routines:

- min precision about 4 decimal digits but capable of (much) higher when a flag is set.
- use no more than 4 FPU registers preferably (could be waived if worth it).
- at least 3 times faster than FPU.
- SIMD - izable, which rules out some techniques.

These two routines, trigC and trigS, are based on Cos and Sin taylor series respectively. trigC is faster but requires one FPU reg per T.S. term (beyond the first "1"). With 4 regs (last term is x^8) it achieves only 5 significant decimal digits. trigS OTOH uses only 4 regs no matter how many terms. This version, at higher precision setting, uses 6 terms, up to x^11, to achieve almost 9 digits. Both routines calculate both sin and cos, with different accuracy patterns (trigS pattern is better, but that gets into too much detail - if interested say so).

On the Intel i5 speed is satisfactory, varying from 4 times faster than FPU to more than 9: only 3 nanoseconds per iteration, with 4 digits precision, enough for some purposes.

However on the AMD they're unsatisfactory. First, they're 3 times slower - rather worse than usual. But the amazing thing is, AMD FPU sin and cos are almost as fast as Intel! Since clock speed is just a little more than half, per cycle AMD is much better.

BTW I've heard it said there's no big difference between Intel and AMD - that's not my experience.

Anyway, I want faster and better routines if possible. These two can be sped up maybe 10% in obvious ways - some of which, no doubt, I'm overlooking - so let me know of any you notice. But also, I'm wondering if there are any tricks that would really make a difference. Some ideas,

- I wonder if fixed point would help. I doubt it.
- By factoring the series you can eliminate one multiply, don't think it's worth it.
- Manipulating the quadrants flag you can eliminate one branch, doesn't seem to help though, the extra memory access kills the advantage.
- Mixing the two Taylor Series in the obvious way - I'm currently investigating that
- I did look on the net, saw various libraries, but it's a lot easier (and almost always better) to just write my own than hassle with them. So please don't just give me a link to a library unless u have reason to think it's worth it.

If you have a candidate routine use my test bed to get stats, or give it to me and I'll do it.

Here is the fastest version I have at the moment, called trigC:

Code: [Select]
.data                             ;; for both trig routines
    piover2 real8 1.5707963267948966
    twooverpi real8 0.63661977236758138
.code
; »»»»»»»»»»»»»»»»»»»»»»»»»»»»»»»»»»»»»»»»»»»»»»»»»»»»»»»»»»»»»»»»»»»»»»»»»
.data                                   ; data used by trigC MACRO
    __cos1 real8 -1.2337005501361697    ; 1/2 (pi/2)^2
    __cos2 real8 0.16666666666666667    ; 2/(3*4)
    __cos3 real8 0.06666666666666667    ; 2/(5*6)
    __cos4 real8 0.03571428571428571    ; 2/(7*8)
    one real8 1.0
.code
; «««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««
;;*******************************       ;; sin or cos in st(0), uses 4 regs
trigC MACRO sincosflag:=<0>             ;; but needs one extra per term                   
;;*******************************
    fmul twooverpi                      ;; div by pi/2 to put in 4 quadrants
    fld st
    fisttp qword ptr [esp-8]            ;; math "int" truncate, down towards -inf
    mov eax, dword ptr [esp-8]          ;; (lower word of) int quotient x / pi/2
    fild qword ptr [esp-8]

    add eax, sincosflag                 ;; sin if sincosflag = 0, cos if 1                     
    fsub                                ;; now mod 1 (0 to .999999) meaning 0-pi/2

    test eax, 1
    jnz @F
        fld1
        fsubrp                          ;; replace w/ 1-x for these quadrants
    @@:

    fmul st, st
    fmul __cos1
    fld st              ;; c1*x^2, c1*x^2

    fmul st, st(1)      ;; c1^2*x^4, c1*x^2 
    fmul __cos2         ;; c2*c1^2*x^4, c1*x^2
    fld st              ;; c2*c1^2*x^4, c2*c1^2*x^4, c1*x^2

    fmul st, st(2)      ;; c2*c1^3*x^6, c2*c1^2*x^4, c1*x^2
    fmul __cos3         ;; c3*c2*c1^3*x^6, c2*c1^2*x^4, c1*x^2

    IF MORE_PRECISION EQ 1                       ;; do one more term
        fld st          ;; c3*c2*c1^3*x^6, c3*c2*c1^3*x^6, c2*c1^2*x^4, c1*x^2
        fmul st, st(3)  ;; c3*c2*c1^4*x^8, c3*c2*c1^3*x^6, c2*c1^2*x^4, c1*x^2
        fmul __cos4     ;; c4*c3*c2*c1^4*x^8, c3*c2*c1^3*x^6, c2*c1^2*x^4, c1*x^2
        fadd
    ENDIF

    fadd
    fadd
    fadd one                            ;; answer in st(0) all other regs free

    and eax, 2
    je @F
        fchs                            ;; was in a negative quadrant
    @@:
ENDM
;;*******************************

and here are runs with timing and precision stats:

Code: [Select]
Intel i5 3330 2.94 Ghz
    ----------------------------------------
FPU fsin nanos per iter         32.129
FPU fcos nanos per iter         31.583

Test Function: trigS ========================================

********** SIN Using faster version **********
Nanoseconds per Iteration       4.996
Speed ratio FPU/test fn         6.38

Precision
average precision               1.59e-005
worst precision                 1.57e-004

********** COS Using faster version **********
Nanoseconds per Iteration       4.910
Speed ratio FPU/test fn         6.49

Precision
average precision               1.59e-005
worst precision                 1.57e-004

********** SIN Using higher precision **********
Nanoseconds per Iteration       8.427
Speed ratio FPU/test fn         3.78

Precision
average precision               4.12e-009
worst precision                 5.63e-008

********** COS Using higher precision **********
Nanoseconds per Iteration       8.380
Speed ratio FPU/test fn         3.8

Precision
average precision               4.12e-009
worst precision                 5.63e-008

Test Function: trigC ========================================

********** SIN Using faster version **********
Nanoseconds per Iteration       3.604
Speed ratio FPU/test fn         8.84

Precision
average precision               1.01e-004
worst precision                 8.95e-004

********** COS Using faster version **********
Nanoseconds per Iteration       3.463
Speed ratio FPU/test fn         9.2

Precision
average precision               1.01e-004
worst precision                 8.95e-004

********** SIN Using higher precision **********
Nanoseconds per Iteration       5.211
Speed ratio FPU/test fn         6.11

Precision
average precision               2.29e-006
worst precision                 2.47e-005

********** COS Using higher precision **********
Nanoseconds per Iteration       5.206
Speed ratio FPU/test fn         6.12

Precision
average precision               2.29e-006
worst precision                 2.47e-005

================================================
================================================
AMD A6 1.8 Ghz
    ----------------------------------------
FPU fsin nanos per iter         35.279
FPU fcos nanos per iter         37.223

Test Function: trigS ========================================

********** SIN Using faster version **********
Nanoseconds per Iteration       17.517
Speed ratio FPU/test fn         2.07

Precision
average precision               1.59e-005
worst precision                 1.57e-004

********** COS Using faster version **********
Nanoseconds per Iteration       17.350
Speed ratio FPU/test fn         2.09

Precision
average precision               1.59e-005
worst precision                 1.57e-004

********** SIN Using higher precision **********
Nanoseconds per Iteration       27.397
Speed ratio FPU/test fn         1.32

Precision
average precision               4.12e-009
worst precision                 5.63e-008

********** COS Using higher precision **********
Nanoseconds per Iteration       27.137
Speed ratio FPU/test fn         1.34

Precision
average precision               4.12e-009
worst precision                 5.63e-008

Test Function: trigC ========================================

********** SIN Using faster version **********
Nanoseconds per Iteration       12.161
Speed ratio FPU/test fn         2.98

Precision
average precision               1.01e-004
worst precision                 8.95e-004

********** COS Using faster version **********
Nanoseconds per Iteration       12.103
Speed ratio FPU/test fn         3

Precision
average precision               1.01e-004
worst precision                 8.95e-004

********** SIN Using higher precision **********
Nanoseconds per Iteration       16.754
Speed ratio FPU/test fn         2.16

Precision
average precision               2.29e-006
worst precision                 2.47e-005

********** COS Using higher precision **********
Nanoseconds per Iteration       16.673
Speed ratio FPU/test fn         2.17

Precision
average precision               2.29e-006
worst precision                 2.47e-005

The zip includes trig32.asm with trigS and trigC, plus test_support_macros.asm with the test bed. Writing that was a PITA, more code than the trig routines, and much less fun. The zip also has trig32.exe which produces tables like that shown above.

[edit] apologies to anyone who downloaded this expecting it to be correct - u shld know me better :) qword pointed out mistake. fisttp shld be fistp, with "down towards -inf" rounding. Has no effect on any runs made so far, but would if u used negative inputs...

[edit] please go to Reply #40 (http://masm32.com/board/index.php?topic=4118.msg43732#msg43732) for fixed / latest versions
Title: Re: Trigonometry ...
Post by: sinsi on March 26, 2015, 05:32:48 PM
AMD A10-7850K 3.7GHz
Code: [Select]
    --------------------------------------
FPU fsin nanos per iter         30.486
FPU fcos nanos per iter         29.228

Test Function: trigS =====================

********** SIN Using faster version ******
Nanoseconds per Iteration       4.541
Speed ratio FPU/test fn         6.57

Precision
average precision               1.59e-005
worst precision                 1.57e-004

********** COS Using faster version ******
Nanoseconds per Iteration       4.527
Speed ratio FPU/test fn         6.6

Precision
average precision               1.59e-005
worst precision                 1.57e-004

********** SIN Using higher precision ****
Nanoseconds per Iteration       7.935
Speed ratio FPU/test fn         3.76

Precision
average precision               4.12e-009
worst precision                 5.63e-008

********** COS Using higher precision ****
Nanoseconds per Iteration       8.240
Speed ratio FPU/test fn         3.62

Precision
average precision               4.12e-009
worst precision                 5.63e-008

Test Function: trigC =====================

********** SIN Using faster version ******
Nanoseconds per Iteration       2.835
Speed ratio FPU/test fn         10.5

Precision
average precision               1.01e-004
worst precision                 8.95e-004

********** COS Using faster version ******
Nanoseconds per Iteration       2.906
Speed ratio FPU/test fn         10.3

Precision
average precision               1.01e-004
worst precision                 8.95e-004

********** SIN Using higher precision ****
Nanoseconds per Iteration       4.382
Speed ratio FPU/test fn         6.81

Precision
average precision               2.29e-006
worst precision                 2.47e-005

********** COS Using higher precision ****
Nanoseconds per Iteration       4.677
Speed ratio FPU/test fn         6.38

Precision
average precision               2.29e-006
worst precision                 2.47e-005
Title: Re: Trigonometry ...
Post by: rrr314159 on March 26, 2015, 05:46:39 PM
Thanks Sinsi, your A10 has similar numbers to my i5. Strange how AMD A6, while generally much slower, is so fast for fsin and fcos
Title: Re: Trigonometry ...
Post by: MichaelW on March 26, 2015, 05:47:21 PM
Core-i3 3.0 GHz:
Code: [Select]
    ----------------------------------------
FPU fsin nanos per iter         34.265
FPU fcos nanos per iter         34.241

Test Function: trigS ========================================

********** SIN Using faster version **********
Nanoseconds per Iteration       5.325
Speed ratio FPU/test fn         6.43

Precision
average precision               1.59e-005
worst precision                 1.57e-004

********** COS Using faster version **********
Nanoseconds per Iteration       5.256
Speed ratio FPU/test fn         6.52

Precision
average precision               1.59e-005
worst precision                 1.57e-004

********** SIN Using higher precision **********
Nanoseconds per Iteration       9.030
Speed ratio FPU/test fn         3.79

Precision
average precision               4.12e-009
worst precision                 5.63e-008

********** COS Using higher precision **********
Nanoseconds per Iteration       8.969
Speed ratio FPU/test fn         3.82

Precision
average precision               4.12e-009
worst precision                 5.63e-008

Test Function: trigC ========================================

********** SIN Using faster version **********
Nanoseconds per Iteration       3.778
Speed ratio FPU/test fn         9.07

Precision
average precision               1.01e-004
worst precision                 8.95e-004

********** COS Using faster version **********
Nanoseconds per Iteration       3.645
Speed ratio FPU/test fn         9.4

Precision
average precision               1.01e-004
worst precision                 8.95e-004

********** SIN Using higher precision **********
Nanoseconds per Iteration       5.434
Speed ratio FPU/test fn         6.3

Precision
average precision               2.29e-006
worst precision                 2.47e-005

********** COS Using higher precision **********
Nanoseconds per Iteration       5.444
Speed ratio FPU/test fn         6.29

Precision
average precision               2.29e-006
worst precision                 2.47e-005

E:\Downloads\NaN\trig_functions\trig functions>pause
Press any key to continue . . .


Title: Re: Trigonometry ...
Post by: dedndave on March 26, 2015, 05:47:55 PM
p-4 prescott w/htt @ 3 GHz
Code: [Select]
FPU fsin nanos per iter 0.048

FPU fcos nanos per iter 0.048



Test Function: trigS ========================================



********** SIN Using faster version **********

Nanoseconds per Iteration 0.029

Speed ratio FPU/test fn 1.63



Precision

average precision 1.59e-005

worst precision    1.57e-004



********** COS Using faster version **********

Nanoseconds per Iteration 0.030

Speed ratio FPU/test fn 1.6



Precision

average precision 1.59e-005

worst precision    1.57e-004



********** SIN Using higher precision **********

Nanoseconds per Iteration 0.032

Speed ratio FPU/test fn 1.51



Precision

average precision 4.12e-009

worst precision    5.63e-008



********** COS Using higher precision **********

Nanoseconds per Iteration 0.031

Speed ratio FPU/test fn 1.51



Precision

average precision 4.12e-009

worst precision    5.63e-008



Test Function: trigC ========================================



********** SIN Using faster version **********

Nanoseconds per Iteration 0.029

Speed ratio FPU/test fn 1.63



Precision

average precision 1.01e-004

worst precision    8.95e-004



********** COS Using faster version **********

Nanoseconds per Iteration 0.029

Speed ratio FPU/test fn 1.63



Precision

average precision 1.01e-004

worst precision    8.95e-004



********** SIN Using higher precision **********

Nanoseconds per Iteration 0.030

Speed ratio FPU/test fn 1.6



Precision

average precision 2.29e-006

worst precision    2.47e-005



********** COS Using higher precision **********

Nanoseconds per Iteration 0.030

Speed ratio FPU/test fn 1.6



Precision

average precision 2.29e-006

worst precision    2.47e-005
Title: Re: Trigonometry ...
Post by: TWell on March 26, 2015, 06:11:29 PM
AMD E1-6010 1.35 GHz
Code: [Select]
    ----------------------------------------
FPU fsin nanos per iter 60.574
FPU fcos nanos per iter 63.197

Test Function: trigS ========================================

********** SIN Using faster version **********
Nanoseconds per Iteration 29.655
Speed ratio FPU/test fn 2.09

Precision
average precision 1.59e-005
worst precision    1.57e-004

********** COS Using faster version **********
Nanoseconds per Iteration 30.067
Speed ratio FPU/test fn 2.06

Precision
average precision 1.59e-005
worst precision    1.57e-004

********** SIN Using higher precision **********
Nanoseconds per Iteration 51.286
Speed ratio FPU/test fn 1.21

Precision
average precision 4.12e-009
worst precision    5.63e-008

********** COS Using higher precision **********
Nanoseconds per Iteration 48.504
Speed ratio FPU/test fn 1.28

Precision
average precision 4.12e-009
worst precision    5.63e-008

Test Function: trigC ========================================

********** SIN Using faster version **********
Nanoseconds per Iteration 20.868
Speed ratio FPU/test fn 2.97

Precision
average precision 1.01e-004
worst precision    8.95e-004

********** COS Using faster version **********
Nanoseconds per Iteration 20.494
Speed ratio FPU/test fn 3.02

Precision
average precision 1.01e-004
worst precision    8.95e-004

********** SIN Using higher precision **********
Nanoseconds per Iteration 28.468
Speed ratio FPU/test fn 2.17

Precision
average precision 2.29e-006
worst precision    2.47e-005

********** COS Using higher precision **********
Nanoseconds per Iteration 29.783
Speed ratio FPU/test fn 2.08

Precision
average precision 2.29e-006
worst precision    2.47e-005
Title: Re: Trigonometry ...
Post by: MichaelW on March 26, 2015, 07:01:26 PM
Core-i3 3.0 GHz:
FPU fsin nanos per iter         34.265


p-4 prescott w/htt @ 3 GHz
FPU fsin nanos per iter    0.048

A P4 with the same clock speed and a lower IPC is ~713 times faster?
Title: Re: Trigonometry ...
Post by: jj2007 on March 26, 2015, 07:25:19 PM
Core i5:
Code: [Select]
FPU fsin nanos per iter         33.359
FPU fcos nanos per iter         32.851

Test Function: trigS ========================================

********** SIN Using faster version **********
Nanoseconds per Iteration       4.705
Speed ratio FPU/test fn         7.04

Precision
average precision               1.59e-005
worst precision                 1.57e-004

********** COS Using faster version **********
Nanoseconds per Iteration       4.586
Speed ratio FPU/test fn         7.22

Precision
average precision               1.59e-005
worst precision                 1.57e-004

********** SIN Using higher precision **********
Nanoseconds per Iteration       8.167
Speed ratio FPU/test fn         4.05

Precision
average precision               4.12e-009
worst precision                 5.63e-008

********** COS Using higher precision **********
Nanoseconds per Iteration       8.208
Speed ratio FPU/test fn         4.03

Precision
average precision               4.12e-009
worst precision                 5.63e-008

Test Function: trigC ========================================

********** SIN Using faster version **********
Nanoseconds per Iteration       3.212
Speed ratio FPU/test fn         10.3

Precision
average precision               1.01e-004
worst precision                 8.95e-004

********** COS Using faster version **********
Nanoseconds per Iteration       3.200
Speed ratio FPU/test fn         10.3

Precision
average precision               1.01e-004
worst precision                 8.95e-004

********** SIN Using higher precision **********
Nanoseconds per Iteration       4.908
Speed ratio FPU/test fn         6.75

Precision
average precision               2.29e-006
worst precision                 2.47e-005

********** COS Using higher precision **********
Nanoseconds per Iteration       4.911
Speed ratio FPU/test fn         6.74

Precision
average precision               2.29e-006
worst precision                 2.47e-005
Title: Re: Trigonometry ...
Post by: hutch-- on March 26, 2015, 07:29:03 PM
On my i7.

Code: [Select]
FPU fsin nanos per iter         30.473
FPU fcos nanos per iter         30.446

Test Function: trigS ========================================

********** SIN Using faster version **********
Nanoseconds per Iteration       8.907
Speed ratio FPU/test fn         3.42

Precision
average precision               1.59e-005
worst precision                 1.57e-004

********** COS Using faster version **********
Nanoseconds per Iteration       9.117
Speed ratio FPU/test fn         3.34

Precision
average precision               1.59e-005
worst precision                 1.57e-004

********** SIN Using higher precision **********
Nanoseconds per Iteration       12.853
Speed ratio FPU/test fn         2.37

Precision
average precision               4.12e-009
worst precision                 5.63e-008

********** COS Using higher precision **********
Nanoseconds per Iteration       12.999
Speed ratio FPU/test fn         2.34

Precision
average precision               4.12e-009
worst precision                 5.63e-008

Test Function: trigC ========================================

********** SIN Using faster version **********
Nanoseconds per Iteration       6.278
Speed ratio FPU/test fn         4.85

Precision
average precision               1.01e-004
worst precision                 8.95e-004

********** COS Using faster version **********
Nanoseconds per Iteration       5.839
Speed ratio FPU/test fn         5.22

Precision
average precision               1.01e-004
worst precision                 8.95e-004

********** SIN Using higher precision **********
Nanoseconds per Iteration       8.680
Speed ratio FPU/test fn         3.51

Precision
average precision               2.29e-006
worst precision                 2.47e-005

********** COS Using higher precision **********
Nanoseconds per Iteration       8.861
Speed ratio FPU/test fn         3.44

Precision
average precision               2.29e-006
worst precision                 2.47e-005
Title: Re: Trigonometry ...
Post by: nidud on March 26, 2015, 11:31:08 PM
AMD Athlon(tm) II X2 245 Processor (SSE3)
Code: [Select]
FPU fsin nanos per iter 0.024

FPU fcos nanos per iter 0.025



Test Function: trigS ========================================



********** SIN Using faster version **********

Nanoseconds per Iteration 0.022

Speed ratio FPU/test fn 1.14



Precision

average precision 1.59e-005

worst precision    1.57e-004



********** COS Using faster version **********

Nanoseconds per Iteration 0.021

Speed ratio FPU/test fn 1.15



Precision

average precision 1.59e-005

worst precision    1.57e-004



********** SIN Using higher precision **********

Nanoseconds per Iteration 0.027

Speed ratio FPU/test fn 0.903



Precision

average precision 4.12e-009

worst precision    5.63e-008



********** COS Using higher precision **********

Nanoseconds per Iteration 0.027

Speed ratio FPU/test fn 0.898



Precision

average precision 4.12e-009

worst precision    5.63e-008



Test Function: trigC ========================================



********** SIN Using faster version **********

Nanoseconds per Iteration 0.021

Speed ratio FPU/test fn 1.16



Precision

average precision 1.01e-004

worst precision    8.95e-004



********** COS Using faster version **********

Nanoseconds per Iteration 0.021

Speed ratio FPU/test fn 1.16



Precision

average precision 1.01e-004

worst precision    8.95e-004



********** SIN Using higher precision **********

Nanoseconds per Iteration 0.026

Speed ratio FPU/test fn 0.96



Precision

average precision 2.29e-006

worst precision    2.47e-005



********** COS Using higher precision **********

Nanoseconds per Iteration 0.025

Speed ratio FPU/test fn 0.962



Precision

average precision 2.29e-006

worst precision    2.47e-005
Title: Re: Trigonometry ...
Post by: Siekmanski on March 27, 2015, 12:12:01 AM
Windows 8.1  Intel(R) Core(TM) i7-4930K CPU @ 3.40GHz

Code: [Select]
    ----------------------------------------
FPU fsin nanos per iter         6.956
FPU fcos nanos per iter         6.860

Test Function: trigS ========================================

********** SIN Using faster version **********
Nanoseconds per Iteration       1.051
Speed ratio FPU/test fn         6.57

Precision
average precision               1.59e-005
worst precision                 1.57e-004

********** COS Using faster version **********
Nanoseconds per Iteration       1.031
Speed ratio FPU/test fn         6.7

Precision
average precision               1.59e-005
worst precision                 1.57e-004

********** SIN Using higher precision **********
Nanoseconds per Iteration       1.835
Speed ratio FPU/test fn         3.76

Precision
average precision               4.12e-009
worst precision                 5.63e-008

********** COS Using higher precision **********
Nanoseconds per Iteration       1.787
Speed ratio FPU/test fn         3.86

Precision
average precision               4.12e-009
worst precision                 5.63e-008

Test Function: trigC ========================================

********** SIN Using faster version **********
Nanoseconds per Iteration       0.749
Speed ratio FPU/test fn         9.23

Precision
average precision               1.01e-004
worst precision                 8.95e-004

********** COS Using faster version **********
Nanoseconds per Iteration       0.718
Speed ratio FPU/test fn         9.62

Precision
average precision               1.01e-004
worst precision                 8.95e-004

********** SIN Using higher precision **********
Nanoseconds per Iteration       1.096
Speed ratio FPU/test fn         6.3

Precision
average precision               2.29e-006
worst precision                 2.47e-005

********** COS Using higher precision **********
Nanoseconds per Iteration       1.096
Speed ratio FPU/test fn         6.3

Precision
average precision               2.29e-006
worst precision                 2.47e-005

Title: Re: Trigonometry ...
Post by: dedndave on March 27, 2015, 01:27:46 AM
anyone notice that my P4 and nidud's Athlon are kicking ass ?

maybe something not right with the test program ?
Title: Re: Trigonometry ...
Post by: rrr314159 on March 27, 2015, 02:50:48 AM
Thanks Marinus,

But, (AFAIK) u need vgather instruction to use LUT with SSE, which is no good for me because, even if I had it, very few others do. Maybe your routine has some way to deal with its absence, I'll see (this evening, busy with "life" stuff now). Your i7 is scary fast, I gotta have one of those!

BTW LUT is used only for first quadrant as a matter of course, the other quadrants are symmetric - if u know anyone using a LUT all around the circle, ... well, don't let them perform any work where numerical skills are vital, let's put it that way.

My LUT currently is only 4096, I did make a version with 16384, even 65536 just for the heck of it, but that was overkill. One problem is that, although precision vs. speed is rather better than Taylor Series, it's not smooth, but stair-steps between the LUT points. For graphics (my main use these days) a less precise, but smooth, curve is much better than a more precise jagged one. (Of course u can interpolate between LUT points to give an n-gon shape but that totally kills speed advantage). Considering that, the LUT and T.S. approaches are fairly equal. The T.S. is easier to deal with. However nice thing about LUT, easily adaptable to other periodic curves such as sin^2, can even make custom curves (user makes with the mouse) that can look cool or weird. Bottom line - I'd stick with LUT if I, and "almost all" users, had vgather.

@dedndave and MichaelW, yes something is definitely wrong w/ test program on those old machines, Prescott and Athlon II! I guess - they don't have QueryPerformanceFrequency instruction? (Or, mfence ... but why didn't they just blow up...?) So they're trying to divide by zero, or random garbage. Sorry. Still don't feel you've wasted your time (dedndave and nidud) those results are more valuable than others, to remind me there are still some computers out there that need alternate routines. Perhaps, for them, I can keep count by making notches in a clay tablet ... At least the actual calc's are being done right, as u can tell from the precision numbers, which should be (and are) identical.

All the other times look right, thanks to all.

[edit] Siekmanski, put life on hold for a moment and looked at your routine, didn't realize it was so simple. Looks like better way to handle LUT than way I've been doing it (which has about twice as many instructions), but - it only looks up one value at a time, right? Not using SIMD - that's why there's no vgather. Presumably T.S. (which can easily calc 4 or 8, whatever, values at once, with SSE) will turn out to be better - will let u know.
Title: Re: Trigonometry ...
Post by: Antariy on March 27, 2015, 05:24:51 AM
rrr, what is "T.S"?

The timings on Celeron D310
Code: [Select]
    ----------------------------------------
FPU fsin nanos per iter         78.477
FPU fcos nanos per iter         78.188

Test Function: trigS ========================================

********** SIN Using faster version **********
Nanoseconds per Iteration       26.469
Speed ratio FPU/test fn         2.96

Precision
average precision               1.59e-005
worst precision                 1.57e-004

********** COS Using faster version **********
Nanoseconds per Iteration       26.556
Speed ratio FPU/test fn         2.95

Precision
average precision               1.59e-005
worst precision                 1.57e-004

********** SIN Using higher precision **********
Nanoseconds per Iteration       43.149
Speed ratio FPU/test fn         1.82

Precision
average precision               4.12e-009
worst precision                 5.63e-008

********** COS Using higher precision **********
Nanoseconds per Iteration       43.107
Speed ratio FPU/test fn         1.82

Precision
average precision               4.12e-009
worst precision                 5.63e-008

Test Function: trigC ========================================

********** SIN Using faster version **********
Nanoseconds per Iteration       22.383
Speed ratio FPU/test fn         3.5

Precision
average precision               1.01e-004
worst precision                 8.95e-004

********** COS Using faster version **********
Nanoseconds per Iteration       22.422
Speed ratio FPU/test fn         3.49

Precision
average precision               1.01e-004
worst precision                 8.95e-004

********** SIN Using higher precision **********
Nanoseconds per Iteration       32.139
Speed ratio FPU/test fn         2.44

Precision
average precision               2.29e-006
worst precision                 2.47e-005

********** COS Using higher precision **********
Nanoseconds per Iteration       32.084
Speed ratio FPU/test fn         2.44

Precision
average precision               2.29e-006
worst precision                 2.47e-005
Title: Re: Trigonometry ...
Post by: dedndave on March 27, 2015, 05:30:46 AM
T.S. = Taylor Series   :P

this machine has QueryPerformanceCounter
something not quite right, there   :redface:
maybe you do a MUL and disregard the high dword ? - something along those lines
Title: Re: Trigonometry ...
Post by: Antariy on March 27, 2015, 05:33:33 AM
Every machine has QueryPerformanceCounter, but the problem is some OSes return the processor frequency with the call to the QueryPerformanceFrequency, and some return not it but some other value, with smaller than TSC resolution.
Title: Re: Trigonometry ...
Post by: dedndave on March 27, 2015, 05:39:33 AM
if he "beats" QueryPerformanceCounter against QueryPerformanceFrequency, it should work correctly either way   :biggrin:
Title: Re: Trigonometry ...
Post by: Antariy on March 27, 2015, 05:57:08 AM
if he "beats" QueryPerformanceCounter against QueryPerformanceFrequency, it should work correctly either way   :biggrin:

That's right, just noted that because once has a faulty cycles counting measure due to Win7 provides not cpu freq but rather some fixed freq.
Title: Re: Trigonometry ...
Post by: Siekmanski on March 27, 2015, 06:13:35 AM
Quote
but - it only looks up one value at a time, right?

Yes, I never needed to calculate more than one at the time.
You need a lot of shuffles to do 4 or 8 at once.
And if so, I precalculate those.

Here's a new one with the useless "sub" removed in part3:
And routines using a Jump table. Don't now if it's faster, haven't tested it yet ( maybe tomorrow )

Edit: new version see Reply #42
Title: Re: Trigonometry ...
Post by: qWord on March 27, 2015, 06:43:19 AM
rrr314159,

the precision significantly decrease for negative values. Also, you did know that FISTTP is an SSE3 instruction?

Just for compare, below an SSE2 implementation for sin(x) (also TP).  Due to the needed truncation the domain is approx. -3.37e9 to +3.37e9. The relative error is below 10-11.
Code: [Select]
    OPTION PROLOGUE:none
    OPTION EPILOGUE:none
    align 16
    sinsd proc
       
        .const
            align 16
            ; head: pi/2  ( hx = Round(mantissa(pi/2)*2^23 )/2^23 )
            sinsd_cc0       REAL8 1.57080078125
            ; tail: pi/2  ( tx = pi /2 - hx )
            sinsd_cc1       REAl8 -4.45445510338076867830836E-6
           
            ; 2/pi
            sinsd_2_pi  REAL8  0.6366197723675813430755351  ; 2/pi
   
            sinsd_qdrnt REAl8  0.0
                        REAl8  1.5707963267948966192313217
                        REAl8  0.0
                        REAl8 -1.5707963267948966192313217
           
            sinsd_c     REAL8 -7.647163731819816475901132E-13
                        REAL8 1.6059043836821614599392377E-10
                        REAL8 -2.50521083854417187750521E-8
                        REAL8 2.75573192239858906525573E-6
                        REAL8 -0.19841269841269841269841E-3
                        REAL8 8.333333333333333333333333E-3
                        REAL8 -0.16666666666666666666667
                        REAL8 1.0
        .code
       
       
        ; k = Floor(x*2/Pi)
        ; xr = (x-C0*k) - C1*k
        ; REF: Jean-Michel Muller, "Elementary Functions, Algorithms and Implementation", Second edition
        movsd xmm3,xmm0
        mulsd xmm3,sinsd_2_pi
        cvttsd2si eax,xmm3
        cvtsi2sd xmm1,eax
        mov edx,eax
        and edx,3
        movsd xmm2,xmm1
        mulsd xmm1,sinsd_cc0
        mulsd xmm2,sinsd_cc1
        subsd xmm0,xmm1
        subsd xmm0,xmm2
       
        jz @calc
        movsd xmm1,xmm0
        movsd xmm0,sinsd_qdrnt[edx*REAL8]
   
        and eax,80000003h
        js @s
        jnp @np
        jp @p
    @s: test eax,2
        jz @p
    @np:    subsd xmm0,xmm1
        jmp @calc
    @p:     addsd xmm0,xmm1
    @calc:
   
        movsd xmm2,xmm0
        mulsd xmm2,xmm2
        movsd xmm3,sinsd_c[0*REAl8]
        mulsd xmm3,xmm2
        addsd xmm3,sinsd_c[1*REAl8]
        mulsd xmm3,xmm2
        addsd xmm3,sinsd_c[2*REAl8]
        mulsd xmm3,xmm2
        addsd xmm3,sinsd_c[3*REAl8]
        mulsd xmm3,xmm2
        addsd xmm3,sinsd_c[4*REAl8]
        mulsd xmm3,xmm2
        addsd xmm3,sinsd_c[5*REAl8]
        mulsd xmm3,xmm2
        addsd xmm3,sinsd_c[6*REAl8]
        mulsd xmm3,xmm2
        addsd xmm3,sinsd_c[7*REAl8]
        mulsd xmm0,xmm3
   
        ret
   
    sinsd endp

regards
Title: Re: Trigonometry ...
Post by: daydreamer on March 27, 2015, 07:03:01 AM
I have old code that creates+rotates etc+x/z,y/z cheating with help of SSE reciprocal instructions, two Points at a time, with help of a sincossincos 256 entry LUT
if anone is interested I dig up code

Title: Re: Trigonometry ...
Post by: Siekmanski on March 27, 2015, 08:14:12 AM
Yeah, i am !!!
Title: Re: Trigonometry ...
Post by: Siekmanski on March 27, 2015, 12:09:51 PM
Here are all the routines and the timing results.
I have included a jump table routine.

for the sources see Reply #42
Title: Re: Trigonometry ...
Post by: MichaelW on March 27, 2015, 02:34:26 PM
You can partially compensate for the lower frequency of the performance counters, relative to the TSC, by synchronizing with the counter at the start of the timed period, effectively reducing the uncertainty from two counter cycles to one. Given the variable clock speed for recent processors, and assuming that there is no reliable method of forcing the processor to run at a fixed frequency, for real-time timings the performance counters are all we have to work with.
Title: Re: Trigonometry ...
Post by: rrr314159 on March 27, 2015, 09:17:17 PM
Hi all,

@qword, thanks for catching that, it's not supposed to be fisttp but fistp. Works the same for positive numbers but for negative it needs fistp with truncate (down to -inf) rounding mode, then the precision remains unchanged. I'll post a fix soon. I'll also take a look at your SSE algo ... the precision at 10^-11 is much better than I need, but more never hurts.

@dedndave, you're suggesting (I think) that you and nidud had considerably longer timings, so somewhere a multiplication that worked OK on my machine is overflowing into the high word on yours - which I'm neglecting to check. Makes sense, if so be easy to fix. I'll take a look ...

@MichaelW, rdtsc works fine on my machines, even with variable clock - I get the exact same numbers as with perf counter (but with 3 extra digits of course). Perhaps I need to check it for longer times, I rarely go above 10 seconds or so - or just look more carefully ;) It's often important to get nanosecond timing, so I can't abandon rdtsc. Therefore I've made two versions of timing routines, using each. For this project, as it turned out I could even use your timers.asm; as posted, I'm only timing one big loop. When I started however I was timing many short segments, doing lots of nanosecond calculations, and not looping; so needed my routines, and just left them in final version. If I get around to it I'll replace them.

@siekmanski, here's what I wrote b4 reading recent posts,

Your LUT technique is better than what I was doing in a couple of ways, checking for quadrants is simpler and faster, so I'll borrow that - thanks! [edit] - I see you've posted an "even better" jump table version, I'll take a look ...

When fed numbers in the right range it's about twice as fast as the Taylor Series routine at about the same precision. But that doesn't change the conclusion that, without vgather instruction, the T.S. should be better for SIMD. It should be able to do 4 numbers at once (or, 8 with AVX) and therefore still be twice (or more) as fast as the LUT - I hope. I'll let u know when finished. [edit] qword's posting may be just what I need ...

However you do have a couple bugs - which probably never happen the way you use the routine. First, gives wrong answer at exactly pi/2 or 3pi/2. This can be fixed two ways. First, use truncate mode instead of the usual rounding mode (probably that's what you do). With normal rounding mode, it will round up to the 16,385th entry. So you can also put one more entry at the top of the table to catch it - gives exact right answer, 1, and only costs 1 byte. Both these fixes are shown below, FWIW.

Second, it doesn't handle larger numbers right. At 10,000 the answers are less precise by almost two digits; at 100,000 loses another digit; and fails entirely somewhere between there and 1,000,000. Basic reason (I figure), you convert real number to integer before truncating down to the first quadrant. If you reduce to the first quadrant using FPU, and only then convert to integer, it avoids this problem; but that's a lot slower. I have to handle larger numbers all the time, so do it that way; probably you don't (or would have noticed this problem). Not sure best way to fix it (or even if u need to fix it), so I left it alone.

Here's the slightly modified LUT as I tested it:

Code: [Select]
; SSE Sin Cos routine by Siekmanski 2010.
include \masm32\include\masm32rt.inc
     .686p
     .xmm

; minor mods rrr314159 2015/03/25

FLT4 MACRO float_number:REQ
LOCAL float_num
    .data
     align 4
     float_num real4 float_number
    .code
    EXITM <float_num>
ENDM

.data?
align 16
sseSinTable real4 1 + 65536/4 dup (?)          ; rrr, add one more entry at top

temp4 real4 ?                                  ; rrr, for going from FPU to xmm

.data
align 16
NegSP dd 080000000h,000000000h,000000000h,000000000h

.code

align 4
InitSinTable proc public uses edi
LOCAL PI2:real4
LOCAL SinCounter:real4

    finit
    fclex
   
;    call initFPUsettruncatemode                ; rrr, set truncate mode or else add one more entry at top
                                                ; I added the entry because it's a little more accurate
fld FLT4(0.0)
fstp SinCounter
fldpi
fadd st(0),st(0)
fstp PI2
lea edi,sseSinTable
mov ecx,65536/4
sin_lp:
fld SinCounter
fdiv FLT4(65536.0)
fmul PI2
fsin
fstp real4 ptr[edi]
fld SinCounter
fadd FLT4(1.0)
fstp SinCounter
add edi,4
dec ecx
; jnz sin_lp
jge sin_lp                    ; rrr, jge instead of jnz to add entry at top
ret

InitSinTable endp


align 16
SSE_Sin proc ; in: xmm0, out: xmm0 with sin value

; modified rrr314159 2015/03/25 to use st(0) in and out of xmm0

fstp temp4
movss xmm0, temp4

mulss xmm0,FLT4(10430.3783504704527249495663163) ; 65536 / PI2
cvttss2si eax,xmm0

and eax,65535
cmp eax,16384-1
ja Part2
movss xmm0,real4 ptr [sseSinTable+eax*4]
jmp rrrret
Part2:
cmp eax,16384*2-1
ja Part3
mov edx,16384*2
sub edx,eax
movss xmm0,real4 ptr [sseSinTable+edx*4]
jmp rrrret
Part3:
cmp eax,16384*3-1
ja Part4
sub eax,16384*2
movss xmm0,real4 ptr [sseSinTable+eax*4]
xorps xmm0,oword ptr NegSP
jmp rrrret
Part4:
mov edx,16384*4
sub edx,eax
movss xmm0,real4 ptr [sseSinTable+edx*4]
xorps xmm0,oword ptr NegSP

rrrret:
    movss temp4, xmm0
    fld temp4
    ret

 SSE_Sin endp


align 16
SSE_Cos proc ; in: xmm0, out: xmm0 with cos value

; modified rrr314159 2015/03/25 to use st(0) in and out of xmm0

fstp temp4
movss xmm0, temp4

mulss xmm0,FLT4(10430.3783504704527249495663163) ; 65536 / PI2
cvttss2si eax,xmm0
add eax,16384
and eax,65535
cmp eax,16384-1
ja Part2
movss xmm0,real4 ptr [sseSinTable+eax*4]
jmp rrrret
Part2:
cmp eax,16384*2-1
ja Part3
mov edx,16384*2
sub edx,eax
movss xmm0,real4 ptr [sseSinTable+edx*4]
jmp rrrret
Part3:
cmp eax,16384*3-1
ja Part4
sub eax,16384*2
movss xmm0,real4 ptr [sseSinTable+eax*4]
xorps xmm0,oword ptr NegSP
jmp rrrret
Part4:
mov edx,16384*4
sub edx,eax
movss xmm0,real4 ptr [sseSinTable+edx*4]
xorps xmm0,oword ptr NegSP

rrrret:
    movss temp4, xmm0
    fld temp4
    ret

 SSE_Cos endp

; «««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««
.data?
    thecontrolword dw ?
.code
; «««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««
initFPUsettruncatemode:
    finit
    fstcw thecontrolword
    fwait
    mov   ax,thecontrolword
    or    ax,0400h  ;this will set both bits of the RC field to the truncating mode
                    ; rounding down to -infinity as math INT function

    mov thecontrolword, ax
    fldcw thecontrolword     ;load the modified Control Word
ret

end

Here are a few test runs to illustrate the LUT's 2X higher speed, relative to T.S., and also the large-number problem:

Code: [Select]
WITH NUMBERS NEAR 0.0

    ----------------------------------------
FPU fsin nanos per iter         31.842
FPU fcos nanos per iter         31.906

Test Function: SIEKMANSKI========================================

********** SIN Using faster version **********
Nanoseconds per Iteration       1.846
Speed ratio FPU/test fn         17.3

Precision
average precision               3.09e-005
worst precision                 9.25e-005 ; GOOD PRECISION

********** COS Using faster version **********
Nanoseconds per Iteration       1.822
Speed ratio FPU/test fn         17.5

Precision
average precision               3.09e-005
worst precision                 9.34e-005

Test Function: trigC ========================================

********** SIN Using faster version **********
Nanoseconds per Iteration       3.562
Speed ratio FPU/test fn         8.95

Precision
average precision               9.98e-005
worst precision                 8.27e-004

********** COS Using faster version **********
Nanoseconds per Iteration       3.317
Speed ratio FPU/test fn         9.61

Precision
average precision               1.00e-004
worst precision                 8.94e-004

**************************************************************
WITH NUMBERS NEAR 10000.0

    ----------------------------------------
FPU fsin nanos per iter         31.873
FPU fcos nanos per iter         31.955

Test Function: SIEKMANSKI========================================

********** SIN Using faster version **********
Nanoseconds per Iteration       1.808
Speed ratio FPU/test fn         17.7

Precision
average precision               5.46e-004
worst precision                 1.42e-003

********** COS Using faster version **********
Nanoseconds per Iteration       1.810
Speed ratio FPU/test fn         17.6

Precision
average precision               5.45e-004
worst precision                 1.36e-003 ; LOSING PRECISION

Test Function: trigC ========================================

********** SIN Using faster version **********
Nanoseconds per Iteration       3.548
Speed ratio FPU/test fn         9

Precision
average precision               1.00e-004
worst precision                 8.92e-004

********** COS Using faster version **********
Nanoseconds per Iteration       3.555
Speed ratio FPU/test fn         8.98

Precision
average precision               9.98e-005
worst precision                 8.28e-004

**************************************************************
WITH NUMBERS NEAR 100000.0

    ----------------------------------------
FPU fsin nanos per iter         31.800
FPU fcos nanos per iter         31.895

Test Function: SIEKMANSKI========================================

********** SIN Using faster version **********
Nanoseconds per Iteration       1.806
Speed ratio FPU/test fn         17.6

Precision
average precision               5.33e-003
worst precision                 1.42e-002 ; LOSING PRECISION

********** COS Using faster version **********
Nanoseconds per Iteration       1.739
Speed ratio FPU/test fn         18.3

Precision
average precision               5.16e-003
worst precision                 1.41e-002

Test Function: trigC ========================================

********** SIN Using faster version **********
Nanoseconds per Iteration       3.518
Speed ratio FPU/test fn         9.05

Precision
average precision               9.99e-005
worst precision                 8.74e-004

********** COS Using faster version **********
Nanoseconds per Iteration       3.536
Speed ratio FPU/test fn         9.01

Precision
average precision               9.98e-005
worst precision                 8.46e-004

Title: Re: Trigonometry ...
Post by: dedndave on March 28, 2015, 12:21:32 AM
@dedndave, you're suggesting (I think) that you and nidud had considerably longer timings...

no - our older processors seem to be faster than all the new ones   :biggrin:
i took a quick look and noticed you are using the FPU for calculations
so, my theory about ignoring a high dword probably isn't the case
Title: Re: Trigonometry ...
Post by: daydreamer on March 28, 2015, 04:59:22 AM
I read that pixelshader is capable of have 1 cycle lookup of sin and cos,anyone capable of writing pixelshader that outputs LUT to system memory?
its earlier ddraw that doesnt work on newer OS,win xp tops ,actually should be wieved in 8bit color palette, sorry cant find the latest source for it
this is very early version I found
Code: [Select]
initcircle PROC
LOCAL @@stp :REAL4
LOCAL @@cnt :REAL4
    finit
    fldpi
    fidiv degrees
    fstp @@stp
    fldz
    fstp @@cnt
    mov eax,0
    mov ecx,[degrees]
    add ecx,ecx
    add ecx,ecx
@l1:fld @@cnt
    fsincos
    fstp REAL4 PTR [eax*8+sintab]
    fstp REAL4 PTR [eax*8+sintab+4]
    fld @@cnt
    fadd @@stp
    fstp @@cnt
    inc eax
    dec ecx
    jne @l1
   

   
ret
initcircle ENDP
circle2 PROC ;SSE version
    mov ecx,[degrees]
    add ecx,ecx
   
    lea edi,[buffer]
    MOVAPS XMM3,[xcentref]
    MOVAPS XMM2,[radius]
@l1:    MOVAPS XMM1,[ecx*8+sintab]
    MULPS XMM1,XMM2
    ADDPS XMM1,XMM3
    MOVAPS [tsse],XMM1
    ;CVTPS2PI MM1,XMM1
    ;MOVQ [tsse],MM1
    ;MOVHLPS XMM0,XMM1
    ;CVTPS2PI MM1,XMM0
    ;MOVQ [tsse+8],MM1
    fld tsse
    fistp tsse
    fld tsse+4
    fistp tsse+4
    fld tsse+8
    fistp tsse+8
    fld tsse+12
    fistp tsse+12
    mov edx,[tsse]
    sal edx,8 ;or whatever it needs to be scaled to
    add edx,[tsse+4]
    mov eax,0f0f0f0f0h
    mov [edi+edx],eax
    mov edx,[tsse+8]
    sal edx,8
    add edx,[tsse+12]
    mov [edi+edx],eax
       
    sub ecx,2
    jne @l1
    ret   
   

circle2 ENDP
here is my first concept of SSE rotation code
Code: [Select]
    Xt = X*COS(í) - Y*SIN(í)
    Yt = X*SIN(í) + Y*COS(í)
SSE:MOVAPS XMM0,X4
MOVAPS XMM2,Y4
MOVAPS XMM1,XMM0
MOVAPS XMM3,XMM2
MULPS XMM0,COSTAB
MULPS XMM1,SINTAB
MULPS XMM2,SINTAB
MULPS XMM3,COSTAB
SUBPS XMM0,XMM2
ADDPS XMM1,XMM3
;
MOVAPS XMM0,X4
MOVAPS XMM1,Y4

and here is my second initial code designed with dx3d in mind,thats why I choosed XYZWUV vertex format ,use of smaller format like xyzw vertex format and you can exchange all slower movups with movaps
Code: [Select]
movaps XMM0,XYZW
    movups XMM4,XYZW+24
    movaps XMM5,XYZW+48
    movups XMM6,XYZW+24*3
    movaps XMM7,XYZW+96
    movaps XMM1,xscale1cossin??? ;why not use a free scale of X?
    movaps XMM2,xscale2sincos???
    mulps XMM0,XMM1
    movaps ???ycosminuszsin???,XMM0
    mulps XMM4,XMM1
    movaps ???ycosminuszsin???+16,XMM4
    mulps XMM5,XMM1
    movaps XMM0,XYZW ;load again how many clock cycles after store?
    movaps ???ycosminuszsin???+32,XMM5
    mulps XMM6,XMM1
    movaps ???ycosminuszsin???+48,XMM6
    mulps XMM7,XMM1
    movaps ???ycosminuszsin???+64,XMM7
    mulps XMM2,XMM0
    movaps ???ycosminuszsin???,XMM0
    movaps ???ysinpluszcos???,XMM2
    ;movss/subss/movss and movss/addss/movss section
      ;after
    movss X,XMM1 
    movss XMM3,Ycos ;now if I input a reg for example eax to this proc for
    subss XMM3,zsin ;what change what axis I gonna rotate
    movss Y,XMM3
    movss XMM4,ysin
    addss XMM4,zcos
    movss Z,XMM4
data XYZWUV ;24byte w UV coordinates
data coscossindummy ;unused in rotate can be used to scale variable
data sinsincosdummy
data xcosycosminuszsindummy
data xcosycosminuszsindummy
   
Title: Re: Trigonometry ...
Post by: rrr314159 on March 28, 2015, 05:30:20 AM
Quote from: siekmanski
You need a lot of shuffles to do 4 or 8 at once.

My case is the opposite: The numbers are being handed to me already packed in a xmm or ymm register. So I'd have to do "a lot of shuffles" to use the LUT. But your routine is so fast, it almost looks worthwhile to unpack the data, apply the LUT one by one, and repack it!

These SIMD instructions always seem to be missing the one instruction u need to make it work (like, integer arithmetic, or vgather). And, they're often disappointingly slow. Nevertheless, they do give a big advantage in straight floating point math performed on data already lined up in memory.

Quote from: dedndave
i took a quick look and noticed you are using the FPU for calculations
so, my theory about ignoring a high dword probably isn't the case

That's right: I'm putting rdtsc output into a real8 right away, no "long integer". Better if you're doing a lot of calculations. Plus my routines can be stuck into "live" code very easily - that's why I prefer them. U know, it's worth noting that in another thread jj2007's "bible" routine wasn't working on your old machine ... so, at the moment I suspect it's something to do with that. (I'm hoping to goad u into figuring out the problem, to defend your computer's reputation :)

@Antariy, thanks for your timing numbers, T.S. routines are slow on those older machines, maybe I should just wait a few years until everybody replaces them with i7's like siekmanski's :)

@qword,

I understand you're not endorsing this SSE algo, just putting it out as an example, and as such it's very useful, gives me a baseline to compare to, so thanks! But I'll bet it can be done better!

First, he's got too many coefficients: the x^15 term must be meaningless, given real8 accuracy. Also (I think) he's mult'ing them the wrong way: starts with smallest terms, up to largest. That means the x^13 coeff (at 10^-10) gets added 6 times, mult'ed 7 times! - when u only need to mult it once, and add. That term will be lost in the noise.

[edit] Got the algo working and I was extremely wrong. Those two terms both add 2 digits precision - and cost 10% in time (very close). So the algo as written has about 12 digits precision, and is only 1.2 times faster than FPU; take off last term, get 10 digits, but now it's about 1.32 faster. Very interesting, thanks again qWord

Also his quadrant determination looks bad - u only need 2 comparisons - he's even got 3 branches one after the other, a no-no (according to Agner Fog). I have a trick that gets it down to only one comparison (altho I'm pretty sure the overhead makes it not worthwhile). (I think) there are other problems ... but let me work on it more, I could be wrong about some (or all) of these opinions, haven't actually tested them yet.

@daydreamer2, Yes I've heard the same thing about pixelshader and other GPGPU floating point units, would love to take advantage of it. ... thx for the code!
Title: Re: Trigonometry ...
Post by: Siekmanski on March 28, 2015, 08:01:53 PM
Hi, rrr314159

Quote
However you do have a couple bugs - which probably never happen the way you use the routine. First, gives wrong answer at exactly pi/2 or 3pi/2. This can be fixed two ways. First, use truncate mode instead of the usual rounding mode (probably that's what you do). With normal rounding mode, it will round up to the 16,385th entry. So you can also put one more entry at the top of the table to catch it - gives exact right answer, 1, and only costs 1 byte. Both these fixes are shown below, FWIW.

You are right, didn't noticed this rounding error.

Quote
Second, it doesn't handle larger numbers right. At 10,000 the answers are less precise by almost two digits; at 100,000 loses another digit; and fails entirely somewhere between there and 1,000,000. Basic reason (I figure), you convert real number to integer before truncating down to the first quadrant. If you reduce to the first quadrant using FPU, and only then convert to integer, it avoids this problem; but that's a lot slower. I have to handle larger numbers all the time, so do it that way; probably you don't (or would have noticed this problem). Not sure best way to fix it (or even if u need to fix it), so I left it alone.

Never used numbers larger than 2pi, but i'll have a look into this larger number issue.
Title: Re: Trigonometry ...
Post by: Siekmanski on March 29, 2015, 01:12:43 AM
Rounding mistake solved:

Code: [Select]
.data?
align 16
sseSinTable     real4 65536/4 dup (?)

.data
align 4
NegSP           dd 080000000h
QuadrantJump    dd offset Quadrant1,offset Quadrant2,offset Quadrant3,offset Quadrant4

.code

align 4
InitSinTable proc uses edi
LOCAL PI2:real4
LOCAL SinCounter:real4
   
    finit
    fclex

    fld     FLT4(0.5)
    fstp    SinCounter
    fldpi
    fadd    st(0),st(0)
    fstp    PI2
    lea     edi,sseSinTable
    mov     ecx,65536/4
sin_lp:
    fld     SinCounter
    fmul    PI2
    fdiv    FLT4(65536.0)   
    fsin
    fstp    real4 ptr[edi]
    fld     SinCounter
    fadd    FLT4(1.0)
    fstp    SinCounter
    add     edi,4
    dec     ecx
    jnz     sin_lp
    ret
InitSinTable endp

align 4
SSE_Sin proc
    mulss       xmm0,FLT4(10430.2191955273608296137974326)  ; 65535 / PI2
    cvttss2si   eax,xmm0
    and         eax,65535
    mov         edx,eax
    shr         edx,14
    jmp         [QuadrantJump+edx*4]
SSE_Sin endp

align 4
SSE_Cos proc
    mulss       xmm0,FLT4(10430.2191955273608296137974326)  ; 65535 / PI2
    cvttss2si   eax,xmm0
    add         eax,16384
    and         eax,65535
    mov         edx,eax
    shr         edx,14
    jmp         [QuadrantJump+edx*4]
SSE_Cos endp

align 4
Quadrant1:
    movss       xmm0,real4 ptr [sseSinTable+eax*4]
    ret
align 4
Quadrant2:
    mov         edx,16384*2
    sub         edx,eax
    movss       xmm0,real4 ptr [sseSinTable+edx*4]
    ret
align 4
Quadrant3:
    movss       xmm0,real4 ptr [sseSinTable-(16384*2*4)+eax*4]
    xorps       xmm0,real4 ptr NegSP
    ret
align 4
Quadrant4:
    mov         edx,16384*4
    sub         edx,eax
    movss       xmm0,real4 ptr [sseSinTable+edx*4]
    xorps       xmm0,real4 ptr NegSP
    ret

Next, attack larger number issue.
Title: Re: Trigonometry ...
Post by: rrr314159 on March 29, 2015, 04:24:01 AM
It's a challenge; one way or another you'll have to take a speed hit. When you multiply by 65535 /2pi the number easily becomes too large for real4. Normally one would mult by 1/2pi - or, actually, 2/pi, to determine the quadrant. Here's how Jean-Michel Muller does it (slightly edited),

Code: [Select]
mulsd xmm0,sinsd_2_pi    ; mul by 2/pi
        cvttsd2si eax,xmm3           ; convert to real8
        cvtsi2sd xmm1,eax            ; convert to integer
        mov edx,eax                     ; save integer in edx
        and edx,3                          ; and determine the quadrant (more or less)
        mulsd xmm1,sinsd_cc0       ; mult the integer back up by pi/2
        subsd xmm0,xmm1            ; subtract from original to get remainder, between 0 and pi/2

He has two advantages, 1) using real8 handles much larger numbers, 2) he doesn't really care about speed. This whole procedure costs him maybe 5%, but it would cost you 50%. So what you want is to do it with integers ... but as long as u mult by 65536/2pi and stick with real4, u can't go much over 1000 or so without losing precision.

Well, it's a challenge!
Title: Re: Trigonometry ...
Post by: Siekmanski on March 29, 2015, 08:54:20 AM
This is how i handled the larger numbers for a 16384 LUT,

Code: [Select]
SSE_Sin proc
    movss       xmm1,xmm0
    mulss       xmm1,FLT4(0.15915494309189533576888376337)  ; 1 / PI2
    cvttss2si   eax,xmm1   
    cvtsi2ss    xmm1, eax
    mulss       xmm1,FLT4(6.28318530717958647692528676656)  ; PI2
    subss       xmm0, xmm1
    mulss       xmm0,FLT4(10430.2191955273608296137974326)  ; 65535 / PI2
    cvttss2si   eax,xmm0
    and         eax,65535
    mov         edx,eax
    shr         edx,14
    jmp         [QuadrantJump+edx*4]
SSE_Sin endp

The timings:
Code: [Select]
FPU and SSE Sin Cos routine timings.

Please wait for 5 seconds......

192 cycles SSE_Sin

193 cycles SSE_Cos

Now the FPU sin and cos timings
This may take a while....

7130 cycles fsin
7130 cycles fcos

Press any key to continue...

Edit: new version see Reply #42
Title: Re: Trigonometry ...
Post by: jj2007 on March 29, 2015, 12:53:51 PM
This is how i handled the larger numbers for a 16384 LUT

A factor 37 is really impressing :t

Code: [Select]
164 cycles SSE_Sin
160 cycles SSE_Cos

Now the FPU sin and cos timings
This may take a while....

5937 cycles fsin
5950 cycles fcos

btw JWasm doesn't like xorps xmm0, real4 ptr NegSP - and it looks indeed a bit fishy.
ML 9.0 encodes it as
Code: [Select]
004011ED          ³.  0F5705 00204000         xorps xmm0, xmmword ptr [402000]       ; float 5.883850e-39, 5.883822e-39, 5.883806e-39, 0.0
Title: Re: Trigonometry ...
Post by: rrr314159 on March 29, 2015, 02:47:33 PM
jj2007,

In this case I believe your suspicions are misplaced :)

Those floats, 5.883850e-39 etc, are the addresses of the labels Quadrant1, Quadrant2 etc, which appear right after the NegSP variable in the proc's LOCAL statement. The float 0.0 is actually 80000000h, the "negative mask", and the statement correctly changes the sign bit of the first real4 in xmm0.

JWasm insists that we write "xmmword ptr" whereas ML helpfully fills it in - the opposite of the usual behavior of the two compilers. It's a bit dangerous, actually. Looks like the statement doesn't touch the upper 3 real4's in xmm0, but it does. Could cause hard-to-find bugs.

In this case, considering the usage of xmm0, everything's copacetic. But if the prog were changed later, could cause probs, so it's good to be aware of it
Title: Re: Trigonometry ...
Post by: jj2007 on March 29, 2015, 07:20:20 PM
Thanks, I had no time to dive into the code, and just saw JWasm's complain :redface:
Title: Re: Trigonometry ...
Post by: Siekmanski on March 29, 2015, 09:13:41 PM
Hi jj,

Quote
btw JWasm doesn't like xorps xmm0, real4 ptr NegSP - and it looks indeed a bit fishy.

Thought to save 12 bytes, but indeed it may be confusing.
Just change these lines,

align 16
NegSP           dd  080000000h,0,0,0

xorps       xmm0,oword ptr NegSP

Marinus
Title: Re: Trigonometry ...
Post by: Siekmanski on March 30, 2015, 09:05:38 AM
Changed the "fishy real4 ptr NegSP" to "non fishy and JWasm proof" in the source code.  :biggrin: :biggrin: :biggrin:
New attachment see Reply #31
Title: Re: Trigonometry ...
Post by: Gunther on March 30, 2015, 09:28:16 AM
Marinus,

the new timings from post #31 with i7-3770 CPU, 3.40 GHz running Windows 7-64:
Code: [Select]
FPU and SSE Sin Cos routine timings.

Please wait for 5 seconds......

171 cycles SSE_Sin

174 cycles SSE_Cos

Now the FPU sin and cos timings
This may take a while....

6262 cycles fsin
6262 cycles fcos

Press any key to continue...


Gunther
Title: Re: Trigonometry ...
Post by: Siekmanski on March 30, 2015, 09:49:50 AM
Thanks  :t
Title: Re: Trigonometry ...
Post by: Siekmanski on March 31, 2015, 10:06:15 AM
Reviewed the routines ( correction in the Cosine routine ) and think i've got it as precise as possible now for the real4 LUT approach.
Added a Tangent routine.

Hi, rrr
Just curious, did you find any more speed up tricks without using a LUT and or calculating more at once ?

Marinus

Edit: new version see Reply #42
Title: Re: Trigonometry ...
Post by: rrr314159 on March 31, 2015, 02:24:15 PM
Hi siekmanski,

glad u asked, I was just about to post this update ...

Here are my current versions of trig routines. There are 2 of them, Taylor Series and LUT.

T.S. - I used a technique from Jean-Michel Muller, from the SSE routine qWord donated, to improve trigS (the sin-T.S. routine.) It gets up to 11 significant digits; or, in reduced precision mode, gets 5-6 digits 9 times faster than FPU.

BTW I've made a SIMD version of that Muller algo (much modified), sped it up somewhat, but it's still awfully slow. Will have something to show one of these days, for now both these 2 routines are clearly superior, in spite of doing only one input at a time.

The previously posted trigC, based on cos T.S., is still pretty good as is, not re-posted here. This one is prob a bit better.

LUT - siekmanski, your latest version seems as precise as can be done, but I realized why I don't have these problems: I use FPU. Even with real4 data FPU does intermediate calc's to 80 bits so it never loses precision; all the way up to 10^38 gives the same results. So I'm posting my FPU LUT technique, called trigLUT. Using your 16384 entry table, thanks. I suppose it's slower than your lookup routine but for me the extra precision is important. Goes 25 times faster than FPU, not far from 1 nanosecond per number.

One reason for this thread was to find out "what I'm doing wrong". I always use "truncate down to -infinity" rounding mode, compatible with mathematics int function, but everybody else uses the default "round-to-nearest". Although that's much clumsier (to me) I guess it's just as good and has the advantage of being standard. So probably I should convert to that. However haven't done so yet, so these routines still truncate to -inf. A lot easier to understand, isn't it?

Also I should use the standard timers.asm, but again it's not very comfortable to me so I've still got my own routines here. Sorry ... I believe the numbers are reasonably accurate but their main purpose is just relative comparison

Here is the LUT routine:

Code: [Select]
;;*******************************
trigLUT MACRO sincosflag:=<0>   ;; rrr314159 2015/03/30 fpu-based lookup
;;*******************************
    fmul twooverpi              ;; div by pi/2 to put in 1st quadrant
    fld st
    fistp qword ptr [esp-8]     ;; math-like truncate, down towards -inf
    mov eax, dword ptr [esp-8]  ;; eax has (lower word of) int quotient x / pi/2
    fild qword ptr [esp-8]
    fsub                        ;; now mod 1 (0 to .999999) meaning 0-pi/2
    add eax, sincosflag         ;; this converts to cos when sincosflag = 1
    test eax, 1
    jz @F
        fld1
        fsubrp                  ;; replace w/ 1-edx for these quadrants
    @@:
   
    fmul FLT8(16384.0)
    fistp dword ptr [esp-4]     ;; get table index
    mov edx, dword ptr [esp-4]
    fld real4 ptr [sseSinTable+edx*4]

    and eax, 2
    je @F
        fchs                    ;; was in a negative quadrant
    @@:
ENDM
;;*******************************

trigS is longer and messier so please check the zip for that. Here is a typical precision / timing run:

Code: [Select]
FPU average nanos per iter      32.600

=======================================================================
Test Function: trigS ========================================
=======================================================================
********** SIN Using faster version **********
Nanoseconds per Iteration       3.589
Speed ratio FPU/test fn         9.08

Precision
average precision               1.58e-005
worst precision                 1.53e-004

********** COS Using faster version **********
Nanoseconds per Iteration       3.584
Speed ratio FPU/test fn         9.1

Precision
average precision               1.57e-005
worst precision                 1.48e-004

********** SIN Using higher precision **********
Nanoseconds per Iteration       7.389
Speed ratio FPU/test fn         4.41

Precision
average precision               2.93e-011
worst precision                 4.59e-011

********** COS Using higher precision **********
Nanoseconds per Iteration       7.407
Speed ratio FPU/test fn         4.4

Precision
average precision               2.93e-011
worst precision                 4.58e-011

=======================================================================
Test Function: trigLUT ========================================
=======================================================================
********** SIN Using higher precision **********
Nanoseconds per Iteration       1.295
Speed ratio FPU/test fn         25.2

Precision
average precision               1.52e-005
worst precision                 4.41e-005

********** COS Using higher precision **********
Nanoseconds per Iteration       1.306
Speed ratio FPU/test fn         25

Precision
average precision               1.53e-005
worst precision                 4.77e-005

The zip contains

trig32.asm       ; the main routines (called 32 because there's also a 64-bit version, BTW)
test_support_macros ; lots of support macros, could stand some cleaning up
trig32.exe
doj32.bat       ; "do JWasm 32-bit" compiling batch file, just for reference

[edit] woops, forget to comment out the line "MY_CPU=1" at top of trig32.asm. Just comment that out. Unless you know your own CPU's FPU sin/cos average speed, in which case put it below at "MY_CPU_NANOS". This saves having to recompute it every time you run the prog.
Title: Re: Trigonometry ...
Post by: Gunther on March 31, 2015, 09:44:03 PM
Hi rrr,

here are the timings on my machine under Windows 7-64:
Code: [Select]
FPU average nanos per iter      32.600
=======================================================================
Test Function: trigS ========================================
=======================================================================

********** SIN Using faster version **********
Nanoseconds per Iteration       2.778
Speed ratio FPU/test fn         11.7

Precision
average precision               1.58e-005
worst precision                 1.53e-004

********** COS Using faster version **********
Nanoseconds per Iteration       2.789
Speed ratio FPU/test fn         11.7

Precision
average precision               1.57e-005
worst precision                 1.48e-004

********** SIN Using higher precision **********
Nanoseconds per Iteration       5.981
Speed ratio FPU/test fn         5.45

Precision
average precision               2.93e-011
worst precision                 4.59e-011

********** COS Using higher precision **********
Nanoseconds per Iteration       6.000
Speed ratio FPU/test fn         5.43

Precision
average precision               2.93e-011
worst precision                 4.58e-011
=======================================================================
Test Function: trigLUT ========================================
=======================================================================

********** SIN Using higher precision **********
Nanoseconds per Iteration       0.896
Speed ratio FPU/test fn         36.4

Precision
average precision               1.52e-005
worst precision                 4.41e-005

********** COS Using higher precision **********
Nanoseconds per Iteration       0.926
Speed ratio FPU/test fn         35.2

Precision
average precision               1.53e-005
worst precision                 4.77e-005

Gunther
Title: Re: Trigonometry ...
Post by: Siekmanski on March 31, 2015, 10:11:42 PM
Improved the larger number issue.
Made a stupid miscalculation in the tangent routine.
New better version, now with single point and double point precision.
Title: Re: Trigonometry ...
Post by: Gunther on March 31, 2015, 10:43:18 PM
Hi Marinus,

thank you for uploading. By the way, here is another stupid false positive by Antivira. What the heck.

Gunther
Title: Re: Trigonometry ...
Post by: Siekmanski on March 31, 2015, 11:02:19 PM
Hi, Gunther
You mean the zip from Reply #42 ?
Title: Re: Trigonometry ...
Post by: Siekmanski on April 01, 2015, 01:52:21 AM
Here are the timings of the routines.

Code: [Select]
SSE2 Sin Cos Tan routines by Siekmanski 2015.

fsin:    0.38268344 ( FPU )
Sin SP:  0.38260040 ( SSE2 Single point )
Sin DP:  0.38260039 ( SSE2 Double point )

fcos:    0.92387953 ( FPU )
Cos SP:  0.92384970 ( SSE2 Single point )
Cos DP:  0.92384972 ( SSE2 Double point )

ftan:    0.41421356 ( FPU )
Tan SP:  0.41410828 ( SSE2 Single point )
Tan DP:  0.41410826 ( SSE2 Double point )

Start timing, please wait for 5 seconds......

183  cycles SSE2_SinSP
186  cycles SSE2_SinDP
7163 cycles fsin (FPU)

194  cycles SSE2_CosSP
197  cycles SSE2_CosDP
7163 cycles fcos (FPU)

191  cycles SSE2_TanSP
191  cycles SSE2_TanDP
7385 cycles ftan (FPU)
Title: Re: Trigonometry ...
Post by: Gunther on April 01, 2015, 02:56:02 AM
Marinus,

You mean the zip from Reply #42 ?

Yes the EXE from the post #42.

Gunther
Title: Re: Trigonometry ...
Post by: Siekmanski on April 01, 2015, 05:13:51 AM
nothing to kill in that zip  :biggrin:
Title: Re: Trigonometry ...
Post by: rrr314159 on April 01, 2015, 09:37:50 AM
@gunther, thanks for running my test, I'm afraid the timing numbers are not too useful, being nonstandard. That's why I decided to borrow siekmanski's timer to get a valid comparison,

hello siekmanski,

I hate to say this ...

I used your timing routine to check out my FPU version of your LUT, and it comes out almost twice as fast, plus an extra 1/2 digit accuracy. I also checked my Taylor Series routine, which is also faster, with an extra digit (at least) precision! Really surprised me - seems this thread was a success. The Taylor Series takes less than twice the FPU LUT time, much better than I expected.

Note, my routines are Macros while yours are procs, so that saves a few cycles by not making the CALL.

I didn't set the rounding mode to truncate, so as not to disturb the test environment. My routines are same for positive numbers (as used here), but would be off for negative. I did check them with that mode, everything comes out exactly the same, since your routines also are unaffected for pos numbers. The truncate routine is included, but the call is commented out.

It makes sense that FPU would be faster, and more accurate, for various reasons, but the bottom line seems to be (as I've mentioned many times) SSE is disappointing. Probably should only be used when SIMD capability gives a real advantage.

The above applies to Intel. For AMD the FPU LUT is still almost twice as fast as SSE LUT, but the rest of the story is different. AMD fsin and fcos are almost 10 times faster than Intel! But the Taylor Series comes out a bit slower than SSE LUT.

Here's my precision/timing runs:
Code: [Select]

Intel i5 3330 3 Ghz

SSE2 Sin Cos routines by Siekmanski 2015.
Also with LUT and Taylor Series Sin Cos by rrr314159 2015.

fsin:    0.38268344 ( FPU )
Sin SP:  0.38260040 ( SSE2 Single point )
rrrLUT:  0.38268897 ( trigLUT Sin )
Taylor:  0.38268344 ( trigS Sin )

fcos:    0.92387953 ( FPU )
Cos SP:  0.92384970 ( SSE2 Single point )
rrrLUT:  0.92388642 ( trigLUT Cos )
Taylor:  0.92386763 ( trigS Cos )

Start timing, please wait for 5 seconds......

174  cycles SSE2_SinSP
77  cycles trigLUT Sin
144  cycles trigS Sin
6714 cycles fsin (FPU)

181  cycles SSE2_CosSP
91  cycles trigLUT Cos
160  cycles trigS Cos
6697 cycles fcos (FPU)

Press any key to continue...

*********************************************
AMD A6 1.8 Ghz

SSE2 Sin Cos routines by Siekmanski 2015.
Also with LUT and Taylor Series Sin Cos by rrr314159 2015.

fsin:    0.38268344 ( FPU )
Sin SP:  0.38260040 ( SSE2 Single point )
rrrLUT:  0.38268897 ( trigLUT Sin )
Taylor:  0.38268344 ( trigS Sin )

fcos:    0.92387953 ( FPU )
Cos SP:  0.92384970 ( SSE2 Single point )
rrrLUT:  0.92388642 ( trigLUT Cos )
Taylor:  0.92386763 ( trigS Cos )

Start timing, please wait for 5 seconds......

214  cycles SSE2_SinSP
110  cycles trigLUT Sin
265  cycles trigS Sin
874 cycles fsin (FPU)

221  cycles SSE2_CosSP
131  cycles trigLUT Cos
287  cycles trigS Cos
906 cycles fcos (FPU)

Press any key to continue...

The zip contains your prog, slightly modified to include my trigLUT routine, also stripped out all Tan and Double Precision.
Title: Re: Trigonometry ...
Post by: Gunther on April 01, 2015, 09:51:14 AM
Marinus,

here are the new timings:
Code: [Select]
SSE2 Sin Cos Tan routines by Siekmanski 2015.

fsin:    0.38268344 ( FPU )
Sin SP:  0.38260040 ( SSE2 Single point )
Sin DP:  0.38260039 ( SSE2 Double point )

fcos:    0.92387953 ( FPU )
Cos SP:  0.92384970 ( SSE2 Single point )
Cos DP:  0.92384972 ( SSE2 Double point )

ftan:    0.41421356 ( FPU )
Tan SP:  0.41410828 ( SSE2 Single point )
Tan DP:  0.41410826 ( SSE2 Double point )

Start timing, please wait for 5 seconds......

167  cycles SSE2_SinSP
168  cycles SSE2_SinDP
6279 cycles fsin (FPU)

170  cycles SSE2_CosSP
173  cycles SSE2_CosDP
6280 cycles fcos (FPU)

168  cycles SSE2_TanSP
168  cycles SSE2_TanDP
6468 cycles ftan (FPU)


Press any key to continue...

and the following are for rrr:
Code: [Select]
SSE2 Sin Cos routines by Siekmanski 2015.
Also with LUT and Taylor Series Sin Cos by rrr314159 2015.

fsin:    0.38268344 ( FPU )
Sin SP:  0.38260040 ( SSE2 Single point )
rrrLUT:  0.38268897 ( trigLUT Sin )
Taylor:  0.38268344 ( trigS Sin )

fcos:    0.92387953 ( FPU )
Cos SP:  0.92384970 ( SSE2 Single point )
rrrLUT:  0.92388642 ( trigLUT Cos )
Taylor:  0.92386763 ( trigS Cos )

Start timing, please wait for 5 seconds......

168  cycles SSE2_SinSP
75  cycles trigLUT Sin
138  cycles trigS Sin
6390 cycles fsin (FPU)

174  cycles SSE2_CosSP
88  cycles trigLUT Cos
154  cycles trigS Cos
6395 cycles fcos (FPU)

Press any key to continue...

Gunther
Title: Re: Trigonometry ...
Post by: Siekmanski on April 01, 2015, 10:08:37 AM
Thanks Gunther.

Wow ! rrr314159, this is really impressive !  :dazzled:
I'll study your source code tomorrow,  this thread is indeed a success.  :t

Code: [Select]
SSE2 Sin Cos routines by Siekmanski 2015.
Also with LUT and Taylor Series Sin Cos by rrr314159 2015.

fsin:    0.38268344 ( FPU )
Sin SP:  0.38260040 ( SSE2 Single point )
rrrLUT:  0.38268897 ( trigLUT Sin )
Taylor:  0.38268344 ( trigS Sin )

fcos:    0.92387953 ( FPU )
Cos SP:  0.92384970 ( SSE2 Single point )
rrrLUT:  0.92388642 ( trigLUT Cos )
Taylor:  0.92386763 ( trigS Cos )

Start timing, please wait for 5 seconds......

184  cycles SSE2_SinSP
76  cycles trigLUT Sin
154  cycles trigS Sin
7130 cycles fsin (FPU)

193  cycles SSE2_CosSP
97  cycles trigLUT Cos
177  cycles trigS Cos
7138 cycles fcos (FPU)


Press any key to continue...
Title: Re: Trigonometry ...
Post by: dedndave on April 01, 2015, 12:50:10 PM
you guys are making me feel old   :lol:

P4 prescott w/htt @ 3.0 GHz
Code: [Select]
SSE2 Sin Cos routines by Siekmanski 2015.
Also with LUT and Taylor Series Sin Cos by rrr314159 2015.

fsin:    0.38268344 ( FPU )
Sin SP:  0.38260040 ( SSE2 Single point )
rrrLUT:  0.38268897 ( trigLUT Sin )
Taylor:  0.38268344 ( trigS Sin )

fcos:    0.92387953 ( FPU )
Cos SP:  0.92384970 ( SSE2 Single point )
rrrLUT:  0.92388642 ( trigLUT Cos )
Taylor:  0.92386763 ( trigS Cos )

Start timing, please wait for 5 seconds......

580  cycles SSE2_SinSP
404  cycles trigLUT Sin
621  cycles trigS Sin
22774 cycles fsin (FPU)

673  cycles SSE2_CosSP
431  cycles trigLUT Cos
668  cycles trigS Cos
22701 cycles fcos (FPU)

very nice work, guys   :t
Title: Re: Trigonometry ...
Post by: Gunther on April 01, 2015, 01:07:13 PM
very nice work, guys   :t

Yes, indeed.  :t My deference.

Gunther
Title: Re: Trigonometry ...
Post by: nidud on April 02, 2015, 12:21:49 AM
Strange, a tread with positive improvement based on testing  :lol:

A few notes

As rrr pointed out it may be a good idea to either convert all algos to macros, or (maybe better) convert all to proc's.

Marinus, my CPU don’t like this one:
Code: [Select]
QuadrantJumpSP  dd offset Quadrant1SP,offset Quadrant2SP,offset Quadrant3SP,offset Quadrant4SP

I tried these canges
Code: [Select]
    shr edx,14
;    jmp [QuadrantJumpSP+edx*4]
jnz Quadrant2SP
    movss xmm0,real4 ptr [SSE2SinTableSP+eax*4]
    ret
SSE2_SinSP endp
...
    shr edx,14
;    jmp [QuadrantJumpSP+edx*4]
jnz Quadrant2SP
SSE2_CosSP endp
align 4
Quadrant1SP:
    movss xmm0,real4 ptr [SSE2SinTableSP+eax*4]
    ret
align 4
Quadrant2SP:
cmp edx,2
je Quadrant3SP
ja Quadrant4SP

    mov edx,16384*2

Test results before and after changes
Code: [Select]
fsin: 0.38268344 ( FPU )
Sin SP:  0.38260040 ( SSE2 Single point )
rrrLUT:  0.38268897 ( trigLUT Sin )
Taylor:  0.38268344 ( trigS Sin )

fcos: 0.92387953 ( FPU )
Cos SP:  0.92384970 ( SSE2 Single point )
rrrLUT:  0.92388642 ( trigLUT Cos )
Taylor:  0.92386763 ( trigS Cos )

Start timing, please wait for 5 seconds......

464  cycles SSE2_SinSP
486  cycles trigLUT Sin
648  cycles trigS Sin
887 cycles fsin (FPU)

484  cycles SSE2_CosSP
536  cycles trigLUT Cos
691  cycles trigS Cos
907 cycles fcos (FPU)
------------------
fsin:    0.38268344 ( FPU )
Sin SP:  0.38260040 ( SSE2 Single point )
rrrLUT:  0.38268897 ( trigLUT Sin )
Taylor:  0.38268344 ( trigS Sin )

fcos:    0.92387953 ( FPU )
Cos SP:  0.92384970 ( SSE2 Single point )
rrrLUT:  0.92388642 ( trigLUT Cos )
Taylor:  0.92386763 ( trigS Cos )

Start timing, please wait for 5 seconds......

434  cycles SSE2_SinSP
484  cycles trigLUT Sin
652  cycles trigS Sin
887 cycles fsin (FPU)

464  cycles SSE2_CosSP
536  cycles trigLUT Cos
691  cycles trigS Cos
907 cycles fcos (FPU)

The changes (or improvement) in this case may be CPU dependent, and in this case tested on AMD, so this must be tested, but it may save a few cycles on these algos.
Code: [Select]
464  cycles SSE2_SinSP
484  cycles SSE2_CosSP
...
434  cycles SSE2_SinSP
464  cycles SSE2_CosSP
Title: Re: Trigonometry ...
Post by: rrr314159 on April 02, 2015, 12:52:46 AM
Looks like it would improve performance on other computers, will give it a try in a while, but some AMD's are strange. Notice the FPU time is almost same as the other routines, but on Intel it's 40 times greater! My AMD A6 is similar. But some AMD's, like sinsi's (I think more modern / powerful) behave like Intel
Title: Re: Trigonometry ...
Post by: jj2007 on April 02, 2015, 02:52:38 AM
Just found this by accident here (http://stackoverflow.com/questions/18662261/fastest-implementation-of-sine-cosine-and-square-root-in-c-doesnt-need-to-b):

Quote
First, the Taylor series is NOT the best/fastest way to implement sine/cos. It is also not the way professional libraries implement these trigonometric functions
...
Check gsl gnu scientific library (or numerical recipes) implementation, and you will see that they basically use Chebyshev Approximation Formula. Chebyshev approximation converges faster, so you need to evaluate fewer terms.
Title: Re: Trigonometry ...
Post by: rrr314159 on April 02, 2015, 06:45:51 AM
@Gunther, dedndave, thanks for the congrats but we're not done yet ... main goal was fast SSE power Series routine. It's coming along, hope to post something soon ...

@jj2007, Chebyshev might be good idea. If you look at the OP you'll see I required a range of precisions and figured Chebyshev was n.g. because the coeff's change depending on the desired precision (unless u want to calc the orthogonal poly's which takes 2 long). But now that u mention it, we can just have entirely different series for different precisions - in the old days we couldn't afford to waste the space. Also remembered the gain as minimal, and the region of convergence problematical, but according to the post u reference it's substantial over our region of interest. Well, easy enuff to check it out - change 3 or 4 constants. I'll let u know how it works out. Thanks !

Doesn't anyone else think it strange that some AMD's have fcos and fsin that's 10 times faster (in cycles) than Intel? Why is that? Perhaps, they use a lot of transistors, Intel didn't think it worthwhile ...? If they were all like nidud's (and my) AMD would prob be better to just use built-in functions.
Title: Re: Trigonometry ...
Post by: rrr314159 on April 02, 2015, 07:17:17 AM
Took a minute to find out - yes, Chebyshev is worth it, especially for medium precision (6-7 digits), it gets at least 1 extra digit of precision. Takes a little longer for currently unknown reason. For higher precisions it's more trouble, zone of convergence shrinks rapidly. So for adjustable-precision algo it's a bit of a PITA, but, worth it. Thanks jj! Aren't u glad you're not wasting your time, in this thread? In math answers are exact and unarguable (for the most part), which makes it much more satisfying than text, where you have to deal with caches, bibles, opinions and who knows what. Now find a way to improve the LUT :biggrin:

I'll be posting an update with these better coefficients when I get the SSE SIMD algo working which should be soon

Quote from: nidud
As rrr pointed out it may be a good idea to either convert all algos to macros, or (maybe better) convert all to proc's.

- I converted them all to macro's because it's faster, admittedly proc's are more portable ... well, at least now we're comparing apples to apples
Title: Re: Trigonometry ...
Post by: Siekmanski on April 02, 2015, 10:31:27 PM
Hi rrr314159,
Studied your mega fast LUT routine.
Maybe you noticed already that your LUT routine crashes sometimes or gives wrong results.

Hi nidud,
The reason i used a jump table was to leverage out the time differences between the 4 quadrant calculations. But with only 4 entries it's not paying off, thus removed the jump table.

Improved the routine a little bit. It's more accurate now.

Code: [Select]
SSE2 Sin Cos routines by Siekmanski 2015.
Also with LUT and Taylor Series Sin Cos by rrr314159 2015.

fsin:    -0.93832556 ( FPU )
Sin SP:  -0.93831056 ( SSE2 Single point )
rrrLUT:  -0.93834370 ( trigLUT Sin )
Taylor:  -0.93830954 ( trigS Sin )

fcos:    0.34575302 ( FPU )
Cos SP:  0.34572631 ( SSE2 Single point )
rrrLUT:  0.34572631 ( trigLUT Cos )
Taylor:  0.34575302 ( trigS Cos )

Start timing, please wait for 5 seconds......

174  cycles SSE2_SinSP
97  cycles trigLUT Sin
186  cycles trigS Sin
7134 cycles fsin (FPU)

180  cycles SSE2_CosSP
86  cycles trigLUT Cos
170  cycles trigS Cos
7133 cycles fcos (FPU)

Title: Re: Trigonometry ...
Post by: rrr314159 on April 03, 2015, 03:02:28 AM
Quote from: siekmanski
Maybe you noticed already that your LUT routine crashes sometimes or gives wrong results.

... It's the problem I've mentioned often, all my routines use "truncate down to -inf" rounding mode, but everybody else uses the default rounding mode, and that's what you're doing here. So the Taylor Series should also give wrong results for negative numbers, some other borderline cases too, where it rounds in the wrong direction. But it doesn't crash because it's not accessing mem incorrectly. The LUT can try to access table entry "16385", which is why an extra entry at the top (1.0) is safer. If you comment IN the call to initFPUsettruncatemode it may fix it entirely. I'll check it out.

Been putting off this rounding mode issue, but have to deal with it. "Math-style" rounding is better (at least one less instruction needed to determine quadrant, and more intuitive) but it's non-standard and I have to bite the bullet and convert to the default mode. No big deal, but hate to do it since all my algos (thousands of lines) use math int function. I intend to continue using it for my own work (unless I find out it's not, as I believe, better), so I'll need 2 versions forever ... a PITA. But can't expect the rest of the world to conform to me!

Meanwhile I've made a new test bed which feeds the algos SSE registers instead of FPU (for SIMD algo) and, naturally, your LUT does better there since it needs no conversion, as it does here. Still not as fast on Intel but on my AMD it's about the same. ... And the FPU sin and cos are 10X faster than Intel! All these routines would be more or less redundant, if everyone had AMD's like mine and nidud's.
Title: Re: Trigonometry ...
Post by: Siekmanski on April 03, 2015, 10:00:24 AM
... It's the problem I've mentioned often, all my routines use "truncate down to -inf" rounding mode, but everybody else uses the default rounding mode, and that's what you're doing here. So the Taylor Series should also give wrong results for negative numbers, some other borderline cases too, where it rounds in the wrong direction.

Didn't get it at first, but now i do. --> ("truncate down to -inf" rounding mode) I'm just a simple dutch guy you know.  :biggrin:
Solved "wrong results for negative numbers" for the the default rounding mode in my previous post.
Title: Re: Trigonometry ...
Post by: Siekmanski on April 03, 2015, 10:59:49 AM
Did a test with this fpu routine, it's slightly faster than the SSE2 routine on my PC.

Code: [Select]
FPUsin_Lut proc
    fld     st(0)
    fmul    FLT8(0.15915494309189533576888376337)
    fistp   real8 ptr [esp-8]
    mov     eax,dword ptr [esp-8]
    fild    real8 ptr [esp-8]
    fmul    FLT8(6.28318530717958647692528676656)
    fsub
    fmul    FLT8(10430.2191955273608296137974326)
    fistp   dword ptr [esp-4]
    mov     eax,dword ptr [esp-4]
    cmp     eax,-1
    jg      greater
    sub     eax,1   
greater:
    and     eax,65535
    mov     edx,eax
    shr     edx,14
    jnz     Q2
    fld     real4 ptr [SSE2SinTableSP+eax*4]
    ret
Q2: cmp     edx,2
    je      Q3
    ja      Q4
    mov     edx,16384*2
    sub     edx,eax
    fld     real4 ptr [SSE2SinTableSP+edx*4-4] 
    ret
Q3: fld     real4 ptr [SSE2SinTableSP-(16384*2*4)+eax*4]
    fchs
    ret
Q4: mov     edx,16384*4
    sub     edx,eax
    fld     real4 ptr [SSE2SinTableSP+edx*4-4]
    fchs
    ret
FPUsin_Lut endp
 

EDIT: Noticed that it is much faster in quadrant 1 and quadrant 2, didn't know "fchs" is such an expensive one. ( see if we can do without it...... )
Title: Re: Trigonometry ...
Post by: rrr314159 on April 03, 2015, 11:31:00 AM
While I was typing your last post appeared, I'll take a look, re. previous post,

...("truncate down to -inf" rounding mode)...

- I guess the official name for it is "floor", and "truncate" usually means "truncate towards 0" and ceiling, "truncate towards +inf". But there's some confusion, in my school days "int" used to mean either floor or truncate depending on the mood ;) and "floor" and "ceiling" were never heard ... today with the influence of computers I think both those terms are accepted, and int is left with the meaning "towards 0". But we never called any of these "rounding" (which is "nearest integer"), much less "rounding modes". This potential confusion is why I always spell it out, "truncate towards ... -inf, zero, +inf, or nearest integer".

I'm just a simple dutch guy you know.

- Yeh right - Gerard 't Hooft is (or, was?) Dutch! Maybe Gunther understands renormalizing spontaneously broken non-abelian gauge theories but not many do, and t'Hooft invented it age 19 (or whatever). Not to mention Huygens, Kuiper, Oort ... and a personal favorite of mine, not such a big name, Bart Bok, the only guy who understood the Milky Way - at least, the only one who could communicate that understanding. You need a better excuse!

Not that u need an excuse, I'm still baffled by 3d d3d9 ... intend to get around to it someday, thanks for that library
Title: Re: Trigonometry ...
Post by: jj2007 on April 03, 2015, 11:32:54 AM
didn't know "fchs" is such an expensive one.

Shouldn't be more than 2 cycles:
Code: [Select]
Intel(R) Core(TM) i5-2450M CPU @ 2.50GHz (SSE4)

2023    cycles for 1000 * fchs
2953    cycles for 1000 * 0-x

2022    cycles for 1000 * fchs
2952    cycles for 1000 * 0-x

2023    cycles for 1000 * fchs
2958    cycles for 1000 * 0-x

2       bytes for fchs
4       bytes for 0-x
Title: Re: Trigonometry ...
Post by: Siekmanski on April 03, 2015, 12:21:33 PM
Hi rrr314159,
Quote
I'm still baffled by 3d d3d9 ... intend to get around to it someday, thanks for that library.

You're welcome, we could use some fast trigonometry functions for that d3d lib.

Hi Jochen,

You are right.
Did replace fchs with this piece of code and it didn't speed things up.

fstp  real4 ptr [esp-4]
xor   real4 ptr [esp-4],80000000h
fld    real4 ptr [esp-4]

   
 
Title: Re: Trigonometry ...
Post by: rrr314159 on April 03, 2015, 01:52:41 PM
Quote from: siekmanski
Did a test with this fpu routine, it's slightly faster than the SSE2 routine on my PC

- That's what I'm finding also. Isn't the slight advantage only because your inputs are in FPU so for SSE, must convert them? If inputs are originally prepared in SSE register, I think the advantage then goes the other way.

And, changing sign is faster for SSE, because can use the xor 80000000h trick without converting, as you must with FPU. So maybe SSE is the better way to go (altho FPU still has that 80-bit intermediate calculation advantage)
Title: Re: Trigonometry ...
Post by: jj2007 on April 03, 2015, 06:09:12 PM
Did replace fchs with this piece of code and it didn't speed things up.

Same for neg:
Code: [Select]
Intel(R) Core(TM) i5-2450M CPU @ 2.50GHz (SSE4)

2034    cycles for 1000 * fchs
2979    cycles for 1000 * 0-x
13219   cycles for 1000 * xor
13187   cycles for 1000 * neg

2036    cycles for 1000 * fchs
2968    cycles for 1000 * 0-x
13214   cycles for 1000 * xor
13211   cycles for 1000 * neg

So it's not fchs that slows things down. Try "hiding" it between non-FPU instructions.
Title: Re: Trigonometry ...
Post by: Siekmanski on April 03, 2015, 09:32:27 PM
Hi Jochen,
Quote
Try "hiding" it between non-FPU instructions.

No place to hide, it's at the very end of the routine.

Here are the timings for all routines.
Measuring the minimum and maximum time each routine takes. ( quadrant 1 and quadrant 4 )
One minor correction in the code. Changed cvttsd2si --> cvtsd2si (left it whilst testing.)

Conclusion: The fpu is faster than SSE ( at least on my PC )

I'm very curious of the results from you guys....

Code: [Select]
SSE2 Sin Cos Tan routines by Siekmanski 2015.

fsin:    0.19866933 ( FPU )
SSE2sin: 0.19866522 ( SSE2 lut )
FPUsin:  0.19866522 ( FPU lut )

fcos:    0.98006658 ( FPU )
SSE2cos: 0.98005313 ( SSE2 lut )
FPUcos:  0.98005313 ( FPU lut )

ftan:    0.20271004 ( FPU )
SSE2tan: 0.20270567 ( SSE2 lut )
FPUtan:  0.20270567 ( FPU lut )

Start timing, please wait for 5 seconds......

176  cycles SSE2sin Quadrant 1
178  cycles SSE2sin Quadrant 4

131  cycles FPUsin Quadrant 1
176  cycles FPUsin Quadrant 4

159  cycles FPUsin from SSE to FPU to SSE Quadrant 1
215  cycles FPUsin from SSE to FPU to SSE Quadrant 4

187  cycles SSE2cos Quadrant 1
193  cycles SSE2cos Quadrant 4

147  cycles FPUcos Quadrant 1
185  cycles FPUcos Quadrant 4

178  cycles FPUcos from SSE to FPU to SSE Quadrant 1
216  cycles FPUcos from SSE to FPU to SSE Quadrant 4

183  cycles SSE2tan Quadrant 1
184  cycles SSE2tan Quadrant 2

132  cycles FPUtan Quadrant 1
148  cycles FPUtan Quadrant 2

157  cycles FPUtan from SSE to FPU to SSE Quadrant 1
181  cycles FPUtan from SSE to FPU to SSE Quadrant 2


Start FPU timing... please wait, this may take a while ......

7130 cycles fsin (FPU)
7130 cycles fcos (FPU)
7356 cycles ftan (FPU)

Press any key to continue...
Title: Re: Trigonometry ...
Post by: Gunther on April 03, 2015, 09:46:55 PM
Hi Marinus,

I'm very curious of the results from you guys....

here they are:
Code: [Select]
SSE2 Sin Cos Tan routines by Siekmanski 2015.

fsin:    0.19866933 ( FPU )
SSE2sin: 0.19866522 ( SSE2 lut )
FPUsin:  0.19866522 ( FPU lut )

fcos:    0.98006658 ( FPU )
SSE2cos: 0.98005313 ( SSE2 lut )
FPUcos:  0.98005313 ( FPU lut )

ftan:    0.20271004 ( FPU )
SSE2tan: 0.20270567 ( SSE2 lut )
FPUtan:  0.20270567 ( FPU lut )

Start timing, please wait for 5 seconds......

162  cycles SSE2sin Quadrant 1
170  cycles SSE2sin Quadrant 4

117  cycles FPUsin Quadrant 1
155  cycles FPUsin Quadrant 4

139  cycles FPUsin from SSE to FPU to SSE Quadrant 1
185  cycles FPUsin from SSE to FPU to SSE Quadrant 4

165  cycles SSE2cos Quadrant 1
167  cycles SSE2cos Quadrant 4

130  cycles FPUcos Quadrant 1
163  cycles FPUcos Quadrant 4

155  cycles FPUcos from SSE to FPU to SSE Quadrant 1
191  cycles FPUcos from SSE to FPU to SSE Quadrant 4

163  cycles SSE2tan Quadrant 1
162  cycles SSE2tan Quadrant 2

117  cycles FPUtan Quadrant 1
130  cycles FPUtan Quadrant 2

138  cycles FPUtan from SSE to FPU to SSE Quadrant 1
157  cycles FPUtan from SSE to FPU to SSE Quadrant 2


Start FPU timing... please wait, this may take a while ......

6271 cycles fsin (FPU)
6272 cycles fcos (FPU)
6469 cycles ftan (FPU)

Press any key to continue...

Gunther
Title: Re: Trigonometry ...
Post by: jj2007 on April 03, 2015, 09:47:58 PM
Core i5:
Code: [Select]
SSE2 Sin Cos Tan routines by Siekmanski 2015.

fsin:    0.19866933 ( FPU )
SSE2sin: 0.19866522 ( SSE2 lut )
FPUsin:  0.19866522 ( FPU lut )

fcos:    0.98006658 ( FPU )
SSE2cos: 0.98005313 ( SSE2 lut )
FPUcos:  0.98005313 ( FPU lut )

ftan:    0.20271004 ( FPU )
SSE2tan: 0.20270567 ( SSE2 lut )
FPUtan:  0.20270567 ( FPU lut )

Start timing, please wait for 5 seconds......

146  cycles SSE2sin Quadrant 1
165  cycles SSE2sin Quadrant 4

117  cycles FPUsin Quadrant 1
154  cycles FPUsin Quadrant 4

139  cycles FPUsin from SSE to FPU to SSE Quadrant 1
181  cycles FPUsin from SSE to FPU to SSE Quadrant 4

151  cycles SSE2cos Quadrant 1
169  cycles SSE2cos Quadrant 4

132  cycles FPUcos Quadrant 1
160  cycles FPUcos Quadrant 4

157  cycles FPUcos from SSE to FPU to SSE Quadrant 1
185  cycles FPUcos from SSE to FPU to SSE Quadrant 4

152  cycles SSE2tan Quadrant 1
163  cycles SSE2tan Quadrant 2

116  cycles FPUtan Quadrant 1
131  cycles FPUtan Quadrant 2

139  cycles FPUtan from SSE to FPU to SSE Quadrant 1
158  cycles FPUtan from SSE to FPU to SSE Quadrant 2


Start FPU timing... please wait, this may take a while ......

5925 cycles fsin (FPU)
5921 cycles fcos (FPU)
6106 cycles ftan (FPU)
Title: Re: Trigonometry ...
Post by: Siekmanski on April 03, 2015, 09:51:25 PM
Thank you Gunther and Jochen.  :t
Title: Re: Trigonometry ...
Post by: rrr314159 on April 03, 2015, 10:07:15 PM
Similar to everyone else except fsin, fcos, ftan. AMD's ftan more than 10X faster than Intel!

Code: [Select]
AMD A6 1.8 Ghz

SSE2 Sin Cos Tan routines by Siekmanski 2015.

fsin:    0.19866933 ( FPU )
SSE2sin: 0.19866522 ( SSE2 lut )
FPUsin:  0.19866522 ( FPU lut )

fcos:    0.98006658 ( FPU )
SSE2cos: 0.98005313 ( SSE2 lut )
FPUcos:  0.98005313 ( FPU lut )

ftan:    0.20271004 ( FPU )
SSE2tan: 0.20270567 ( SSE2 lut )
FPUtan:  0.20270567 ( FPU lut )

Start timing, please wait for 5 seconds......

215  cycles SSE2sin Quadrant 1
258  cycles SSE2sin Quadrant 4

163  cycles FPUsin Quadrant 1
224  cycles FPUsin Quadrant 4

187  cycles FPUsin from SSE to FPU to SSE Quadrant 1
261  cycles FPUsin from SSE to FPU to SSE Quadrant 4

218  cycles SSE2cos Quadrant 1
257  cycles SSE2cos Quadrant 4

162  cycles FPUcos Quadrant 1
238  cycles FPUcos Quadrant 4

202  cycles FPUcos from SSE to FPU to SSE Quadrant 1
269  cycles FPUcos from SSE to FPU to SSE Quadrant 4

218  cycles SSE2tan Quadrant 1
229  cycles SSE2tan Quadrant 2

162  cycles FPUtan Quadrant 1
218  cycles FPUtan Quadrant 2

189  cycles FPUtan from SSE to FPU to SSE Quadrant 1
239  cycles FPUtan from SSE to FPU to SSE Quadrant 2


Start FPU timing... please wait, this may take a while ......

874 cycles fsin (FPU)
907 cycles fcos (FPU)
560 cycles ftan (FPU)

Press any key to continue...
Title: Re: Trigonometry ...
Post by: Siekmanski on April 03, 2015, 10:26:39 PM
Hi, rrr314159

Funny that ftan is faster than fsin and fcos on your machine.
Or it must be that fsincos is that fast, because i calculated ftan with fsincos and fdivp.
Title: Re: Trigonometry ...
Post by: dedndave on April 03, 2015, 11:35:05 PM
P4 prescott w/htt
Code: [Select]
SSE2 Sin Cos Tan routines by Siekmanski 2015.

fsin:    0.19866933 ( FPU )
SSE2sin: 0.19866522 ( SSE2 lut )
FPUsin:  0.19866522 ( FPU lut )

fcos:    0.98006658 ( FPU )
SSE2cos: 0.98005313 ( SSE2 lut )
FPUcos:  0.98005313 ( FPU lut )

ftan:    0.20271004 ( FPU )
SSE2tan: 0.20270567 ( SSE2 lut )
FPUtan:  0.20270567 ( FPU lut )

Start timing, please wait for 5 seconds......

591  cycles SSE2sin Quadrant 1
577  cycles SSE2sin Quadrant 4

387  cycles FPUsin Quadrant 1
596  cycles FPUsin Quadrant 4

624  cycles FPUsin from SSE to FPU to SSE Quadrant 1
705  cycles FPUsin from SSE to FPU to SSE Quadrant 4

549  cycles SSE2cos Quadrant 1
931  cycles SSE2cos Quadrant 4

543  cycles FPUcos Quadrant 1
604  cycles FPUcos Quadrant 4

662  cycles FPUcos from SSE to FPU to SSE Quadrant 1
719  cycles FPUcos from SSE to FPU to SSE Quadrant 4

592  cycles SSE2tan Quadrant 1
604  cycles SSE2tan Quadrant 2

387  cycles FPUtan Quadrant 1
536  cycles FPUtan Quadrant 2

623  cycles FPUtan from SSE to FPU to SSE Quadrant 1
662  cycles FPUtan from SSE to FPU to SSE Quadrant 2


Start FPU timing... please wait, this may take a while ......

20852 cycles fsin (FPU)
20837 cycles fcos (FPU)
20332 cycles ftan (FPU)
Title: Re: Trigonometry ...
Post by: Siekmanski on April 03, 2015, 11:48:26 PM
Thanks Dave  :t
Title: Re: Trigonometry ...
Post by: FORTRANS on April 04, 2015, 02:58:24 AM
Hi,

   Three sets of results.  Given that these are all Intel, the FPU
results look odd.

3210 cycles fsin (FPU)
7194 cycles fsin (FPU)
17200 cycles fsin (FPU)

  Oh well, good to find out I suppose.

Code: [Select]
(Old Laptop)

Vendor String: GenuineIntel
Brand  String: Intel(R)Pentium(R)Mprocessor1.70GHz

SSE2 Sin Cos Tan routines by Siekmanski 2015.

fsin:    0.19866933 ( FPU )
SSE2sin: 0.19866522 ( SSE2 lut )
FPUsin:  0.19866522 ( FPU lut )

fcos:    0.98006658 ( FPU )
SSE2cos: 0.98005313 ( SSE2 lut )
FPUcos:  0.98005313 ( FPU lut )

ftan:    0.20271004 ( FPU )
SSE2tan: 0.20270567 ( SSE2 lut )
FPUtan:  0.20270567 ( FPU lut )

Start timing, please wait for 5 seconds......

470  cycles SSE2sin Quadrant 1
512  cycles SSE2sin Quadrant 4

551  cycles FPUsin Quadrant 1
583  cycles FPUsin Quadrant 4

623  cycles FPUsin from SSE to FPU to SSE Quadrant 1
642  cycles FPUsin from SSE to FPU to SSE Quadrant 4

491  cycles SSE2cos Quadrant 1
524  cycles SSE2cos Quadrant 4

574  cycles FPUcos Quadrant 1
594  cycles FPUcos Quadrant 4

636  cycles FPUcos from SSE to FPU to SSE Quadrant 1
657  cycles FPUcos from SSE to FPU to SSE Quadrant 4

471  cycles SSE2tan Quadrant 1
511  cycles SSE2tan Quadrant 2

552  cycles FPUtan Quadrant 1
553  cycles FPUtan Quadrant 2

623  cycles FPUtan from SSE to FPU to SSE Quadrant 1
616  cycles FPUtan from SSE to FPU to SSE Quadrant 2


Start FPU timing... please wait, this may take a while ......

3210 cycles fsin (FPU)
3202 cycles fcos (FPU)
3187 cycles ftan (FPU)

Press any key to continue...

==================
(Newer Laptop)

Vendor String: GenuineIntel
Brand  String: Intel(R)Core(TM)i3-4005UCPU@1.70GHz

16-bit floating-SSE2 Sin Cos Tan routines by Siekmanski 2015.

fsin:    0.19866933 ( FPU )
SSE2sin: 0.19866522 ( SSE2 lut )
FPUsin:  0.19866522 ( FPU lut )

fcos:    0.98006658 ( FPU )
SSE2cos: 0.98005313 ( SSE2 lut )
FPUcos:  0.98005313 ( FPU lut )

ftan:    0.20271004 ( FPU )
SSE2tan: 0.20270567 ( SSE2 lut )
FPUtan:  0.20270567 ( FPU lut )

Start timing, please wait for 5 seconds......

185  cycles SSE2sin Quadrant 1
210  cycles SSE2sin Quadrant 4

178  cycles FPUsin Quadrant 1
238  cycles FPUsin Quadrant 4

148  cycles FPUsin from SSE to FPU to SSE Quadrant 1
192  cycles FPUsin from SSE to FPU to SSE Quadrant 4

185  cycles SSE2cos Quadrant 1
182  cycles SSE2cos Quadrant 4

134  cycles FPUcos Quadrant 1
167  cycles FPUcos Quadrant 4

161  cycles FPUcos from SSE to FPU to SSE Quadrant 1
205  cycles FPUcos from SSE to FPU to SSE Quadrant 4

185  cycles SSE2tan Quadrant 1
185  cycles SSE2tan Quadrant 2

117  cycles FPUtan Quadrant 1
136  cycles FPUtan Quadrant 2

142  cycles FPUtan from SSE to FPU to SSE Quadrant 1
167  cycles FPUtan from SSE to FPU to SSE Quadrant 2


Start FPU timing... please wait, this may take a while ......

7194 cycles fsin (FPU)
7209 cycles fcos (FPU)
7413 cycles ftan (FPU)

Press any key to continue...

                ==================
(Loaner?)
Vendor String: GenuineIntel
Brand  String: Intel(R)Pentium(R)4CPU2.40GHz

SSE2 Sin Cos Tan routines by Siekmanski 2015.

fsin:    0.19866933 ( FPU )
SSE2sin: 0.19866522 ( SSE2 lut )
FPUsin:  0.19866522 ( FPU lut )

fcos:    0.98006658 ( FPU )
SSE2cos: 0.98005313 ( SSE2 lut )
FPUcos:  0.98005313 ( FPU lut )

ftan:    0.20271004 ( FPU )
SSE2tan: 0.20270567 ( SSE2 lut )
FPUtan:  0.20270567 ( FPU lut )

Start timing, please wait for 5 seconds......

418  cycles SSE2sin Quadrant 1
433  cycles SSE2sin Quadrant 4

511  cycles FPUsin Quadrant 1
681  cycles FPUsin Quadrant 4

632  cycles FPUsin from SSE to FPU to SSE Quadrant 1
801  cycles FPUsin from SSE to FPU to SSE Quadrant 4

420  cycles SSE2cos Quadrant 1
713  cycles SSE2cos Quadrant 4

576  cycles FPUcos Quadrant 1
691  cycles FPUcos Quadrant 4

693  cycles FPUcos from SSE to FPU to SSE Quadrant 1
821  cycles FPUcos from SSE to FPU to SSE Quadrant 4

420  cycles SSE2tan Quadrant 1
421  cycles SSE2tan Quadrant 2

514  cycles FPUtan Quadrant 1
552  cycles FPUtan Quadrant 2

637  cycles FPUtan from SSE to FPU to SSE Quadrant 1
670  cycles FPUtan from SSE to FPU to SSE Quadrant 2


Start FPU timing... please wait, this may take a while ......

17200 cycles fsin (FPU)
17116 cycles fcos (FPU)
17985 cycles ftan (FPU)

Press any key to continue...

HTH,

Steve N.
Title: Re: Trigonometry ...
Post by: rrr314159 on April 04, 2015, 03:19:21 AM
Quote from: siekmanski
Funny that ftan is faster than fsin and fcos on your machine.
Or it must be that fsincos is that fast, because i calculated ftan with fsincos and fdivp.

Yes, I often get strange timings with this AMD - don't seem to make sense. I bet nidud would get same odd result. Modern / more powerful AMDs like sinsi's behave more like Intel

Given that these are all Intel, the FPU results look odd.

There are enough odd results here and in many other threads, no matter how you do the timing or what machine you use, to make one take all these figures with a grain of salt
Title: Re: Trigonometry ...
Post by: TWell on April 04, 2015, 04:39:11 AM
Code: [Select]
AMD E1-6010 1,35 GHz
SSE2 Sin Cos Tan routines by Siekmanski 2015.

fsin:    0.19866933 ( FPU )
SSE2sin: 0.19866522 ( SSE2 lut )
FPUsin:  0.19866522 ( FPU lut )

fcos:    0.98006658 ( FPU )
SSE2cos: 0.98005313 ( SSE2 lut )
FPUcos:  0.98005313 ( FPU lut )

ftan:    0.20271004 ( FPU )
SSE2tan: 0.20270567 ( SSE2 lut )
FPUtan:  0.20270567 ( FPU lut )

Start timing, please wait for 5 seconds......

277  cycles SSE2sin Quadrant 1
320  cycles SSE2sin Quadrant 4

205  cycles FPUsin Quadrant 1
281  cycles FPUsin Quadrant 4

233  cycles FPUsin from SSE to FPU to SSE Quadrant 1
330  cycles FPUsin from SSE to FPU to SSE Quadrant 4

274  cycles SSE2cos Quadrant 1
329  cycles SSE2cos Quadrant 4

205  cycles FPUcos Quadrant 1
297  cycles FPUcos Quadrant 4

242  cycles FPUcos from SSE to FPU to SSE Quadrant 1
340  cycles FPUcos from SSE to FPU to SSE Quadrant 4

276  cycles SSE2tan Quadrant 1
280  cycles SSE2tan Quadrant 2

204  cycles FPUtan Quadrant 1
276  cycles FPUtan Quadrant 2

235  cycles FPUtan from SSE to FPU to SSE Quadrant 1
302  cycles FPUtan from SSE to FPU to SSE Quadrant 2


Start FPU timing... please wait, this may take a while ......

1080 cycles fsin (FPU)
1117 cycles fcos (FPU)
701 cycles ftan (FPU)

Press any key to continue...
Title: Re: Trigonometry ...
Post by: dedndave on April 04, 2015, 10:20:15 AM
ya know - i was thinking the FPU timings shown were awful slow
are you sure there isn't something amiss ?

similar for many FPU trig functions...

The angle must be expressed in radians and be within the -263 to +263 range.
If the source angle value is outside the acceptable range (but not INFINITY), the C2 flag of the Status Word
is set to 1 and the content of all data registers remains unchanged; the TOP register field of the Status Word
is not modified and no exception is detected.

if that occurs, or if the FPU stack is full (exception?), it would run very slow
Title: Re: Trigonometry ...
Post by: rrr314159 on April 04, 2015, 11:16:39 AM
Hi dedndave,

All siekmanski's timings are for a REPEAT 10 block. So, divide by 10 to get the times per one calculation. My timings (see previous pages) report one calculation, and I get somewhat more than 100 cycles for an FPU sin or cos. Since we're using these numbers only for comparison it doesn't matter as long as it's apples to apples. Many, I think most, laboratory threads follow the same practice; the numbers they're working so hard to minimize are for a REPEAT block of 10 or 100.

siekmanski or anyone else pls correct me if wrong.

No it's not overflow or any exception - believe me I know when that happens, have plenty of experience! Most common is when you don't balance FPU stack pushes and pops - when repeating 100,000,000 times, or trying to, the condition is unmistakable
Title: Re: Trigonometry ...
Post by: rrr314159 on April 04, 2015, 01:50:19 PM
Here is a trigSSE SIMD version which does two real8's in xmm0. It uses Default Rounding Mode (no more Floor Truncation), and MichaelW's timing macros. I'm working on a real4 version to do 4 at once, and also YMM AVX capability. This version does Cos as well as Sin, but uses Sin Power Series, so Cos is almost a cycle slower. Need to do a Cos Power Series version to speed that up.

I also made a better FPU version, perhaps I'll post it later. Does only one at a time, of course, a little slower per calculation than this SSE, but if you only need one at a time it's faster.

@qWord, that Jean-Michel Muller algorithm was a good start but it's very slow. When reduced to similar precision as trigSSE, about 4 times slower. Can provide details if anyone's interested - bottom line, he knows his numbers, but don't go to him to learn how to program!

As far as I can tell trigSSE is almost down to 1 nanosecond per calculation on my i5: about 4 cycles. I believe these numbers are pretty accurate of course it's hard to be sure. Here are timings:

Code: [Select]
Intel i5 3330
    ----------------------------------------
USING Default Round-to-Nearest
BLANK LOOP total is     162

Test Function: trigSSE Sin ========================================

Cycles for 100 Calcs    472

Precision (HIGHER)
average precision       3.77e-007
worst precision         5.97e-007

Test Function: trigSSE Cos ========================================

Cycles for 100 Calcs    520

Precision (HIGHER)
average precision       3.77e-007
worst precision         5.97e-007

Test Function: trigSSE Sin ========================================

Cycles for 100 Calcs    365

Precision (FASTER)
average precision       4.33e-005
worst precision         6.80e-005

Test Function: trigSSE Cos ========================================

Cycles for 100 Calcs    414

Precision (FASTER)
average precision       4.33e-005
worst precision         6.80e-005

===================================================================
AMD A6
    ----------------------------------------
USING Default Round-to-Nearest
BLANK LOOP total is     162

Test Function: trigSSE Sin ========================================

Cycles for 100 Calcs    933

Precision (HIGHER)
average precision       3.77e-007
worst precision         5.97e-007

Test Function: trigSSE Cos ========================================

Cycles for 100 Calcs    994

Precision (HIGHER)
average precision       3.77e-007
worst precision         5.97e-007

Test Function: trigSSE Sin ========================================

Cycles for 100 Calcs    725

Precision (FASTER)
average precision       4.33e-005
worst precision         6.80e-005

Test Function: trigSSE Cos ========================================

Cycles for 100 Calcs    824

Precision (FASTER)
average precision       4.33e-005
worst precision         6.80e-005

It does two precisions, about 10^-7 and 10^-5; think I can make those a bit better.

Here is the basic algo, see the zip for supporting constants and other details.

Code: [Select]
;;*******************************
trigSSE MACRO sincosflag:=<0>       ;; SSE SIMD sincosflag = 0 for sin
;;*******************************
    IF sincosflag
        addpd xmm0, op piover2      ;; converts to cos when sincosflag = 1
    ENDIF

    mulpd xmm0, op oneoverpi        ;; intro, convert to -.5 to .499999
    cvtpd2dq xmm1, xmm0
    cvtdq2pd xmm2, xmm1
    subpd xmm0, xmm2
    pslld xmm1, 31
    pshufd xmm1, xmm1, 073h         ;; convert xmm1 int's to neg masks
    xorpd xmm0, xmm1                ;; chs for left hemisphere
    mulpd xmm0, op pi
   
    movapd xmm2,xmm0
    mulpd xmm2,xmm2
    IF MORE_PRECISION
        movapd xmm1, op cheby_3
        mulpd xmm1, xmm2
        addpd xmm1, op cheby_2
        mulpd xmm1, xmm2
        addpd xmm1, op cheby_1
        mulpd xmm1, xmm2
        addpd xmm1, op cheby_0
        mulpd xmm0, xmm1
    ELSE
        movapd xmm1, op ch2
        mulpd xmm1, xmm2
        addpd xmm1, op ch1
        mulpd xmm1, xmm2
        addpd xmm1, op ch0
        mulpd xmm0, xmm1
    ENDIF
ENDM

Zip contains:

trig_macros.asm: the trigSSE macro and FPU macro
trig.asm: controller, simple to use I hope
test_support_macros: the hard part.
trig.exe
answers.out: 400 calc'ed vals with deltas
dotrig.bat: "MAKEFILE" batch file
Title: Re: Trigonometry ...
Post by: rrr314159 on April 04, 2015, 03:27:39 PM
@TWell, thanks for the contribution, these AMD's are odd, how can fsincos / fdivp be faster than fsin, which itself is almost 10 X faster than Intel? If u ever figure it out let us know
Title: Re: Trigonometry ...
Post by: TWell on April 04, 2015, 04:43:30 PM
Code: [Select]
AMD E1-6010 1,35 GHz
    ----------------------------------------
USING Default Round-to-Nearest
BLANK LOOP total is 168

Test Function: trigSSE Sin ========================================

Cycles for 100 Calcs 1221

Precision (HIGHER)
average precision 3.77e-007
worst precision    5.97e-007

Test Function: trigSSE Cos ========================================

Cycles for 100 Calcs 1279

Precision (HIGHER)
average precision 3.77e-007
worst precision    5.97e-007

Test Function: trigSSE Sin ========================================

Cycles for 100 Calcs 928

Precision (FASTER)
average precision 4.33e-005
worst precision    6.80e-005

Test Function: trigSSE Cos ========================================

Cycles for 100 Calcs 1070

Precision (FASTER)
average precision 4.33e-005
worst precision    6.80e-005
Title: Re: Trigonometry ...
Post by: Siekmanski on April 04, 2015, 09:44:38 PM
Steve and Tim, thanks for the input.
Dave, timings are for a REPEAT 10 block. Divide by 10 to get the times per one calculation. ( forgot to mention that. )

rrr314159, Here are the timings:

Code: [Select]
Intel i7-4930K @3.4GHz
    ----------------------------------------
USING Default Round-to-Nearest
BLANK LOOP total is     179

Test Function: trigSSE Sin ========================================

Cycles for 100 Calcs    577

Precision (HIGHER)
average precision       3.77e-007
worst precision         5.97e-007

Test Function: trigSSE Cos ========================================

Cycles for 100 Calcs    657

Precision (HIGHER)
average precision       3.77e-007
worst precision         5.97e-007

Test Function: trigSSE Sin ========================================

Cycles for 100 Calcs    496

Precision (FASTER)
average precision       4.33e-005
worst precision         6.80e-005

Test Function: trigSSE Cos ========================================

Cycles for 100 Calcs    518

Precision (FASTER)
average precision       4.33e-005
worst precision         6.80e-005
Title: Re: Trigonometry ...
Post by: Gunther on April 04, 2015, 10:43:48 PM
Hi rrr,

here are the timings:
Code: [Select]
USING Default Round-to-Nearest
BLANK LOOP total is     142

Test Function: trigSSE Sin ========================================

Cycles for 100 Calcs    454

Precision (HIGHER)
average precision       3.77e-007
worst precision         5.97e-007

Test Function: trigSSE Cos ========================================

Cycles for 100 Calcs    501

Precision (HIGHER)
average precision       3.77e-007
worst precision         5.97e-007

Test Function: trigSSE Sin ========================================

Cycles for 100 Calcs    354

Precision (FASTER)
average precision       4.33e-005
worst precision         6.80e-005

Test Function: trigSSE Cos ========================================

Cycles for 100 Calcs    400

Precision (FASTER)
average precision       4.33e-005
worst precision         6.80e-005

Gunther
Title: Re: Trigonometry ...
Post by: rrr314159 on April 05, 2015, 05:33:23 PM
Here is the "final" SSE SIMD Sin/Cos algo, which does double (two at once) or single (four at once), selected by the flag "_USEDOUBLE" in trig.asm. There's only one algo, written with macros which switch from double to single according to the flag. For instance the statement "rmul xmm0, xmm1" can mean either mulpd or mulps. The macros are defined in "SSE48.inc":

Code: [Select]
IF _USEDOUBLE
    WORD_SIZE = 8
ELSE
    WORD_SIZE = 4
ENDIF
WORD_SIZE$ textequ %WORD_SIZE
CHUNK = 16/WORD_SIZE                ; (SSE, no AVX) CHUNK = 2 double, 4 single
%rtype textequ <real>,<&WORD_SIZE$> ; real8 or real4
%rptr textequ <&rtype>,< ptr>     ; use rptr only for vals which change
%tempreal textequ <temp>,<&WORD_SIZE$>
%absmask textequ <absmask>,<&WORD_SIZE$>

scratchxmm equ xmm3

;===============================
IF _USEDOUBLE    ;; for real8 xmm double prec
;===============================
    rmov equ movapd
    radd equ addpd
    rsub equ subpd
    rmul equ mulpd
    rand equ andpd
    rmax equ maxpd
    rhmax MACRO thexmm
pshufd scratchxmm, thexmm, 8dh
rmax thexmm, scratchxmm
    ENDM
    rhadd equ haddpd
    rmovs8 equ movsd ;; scalar move into real8
    rcvtr2i equ cvtpd2dq
    rcvti2r equ cvtdq2pd
    rxor equ xorpd
    rcvt2negmask MACRO thexmm ;; convert xmm1 int's to neg masks
    pslld thexmm, 31
    pshufd thexmm, thexmm, 073h       
    ENDM
;===============================
ELSE ;; for real4 xmm single prec
;===============================
    rmov equ movaps
    radd equ addps
    rsub equ subps
    rmul equ mulps
    rand equ andps
    rmax equ maxps
    rhmax MACRO thexmm
movhlps scratchxmm, thexmm
maxps thexmm, scratchxmm
pshufd scratchxmm, thexmm, 55h
maxps thexmm, scratchxmm
    ENDM
    rhadd MACRO thexmm, thexmm
haddps thexmm, thexmm
haddps thexmm, thexmm
    ENDM
    rmovs8 MACRO thereal, thexmm ;; scalar move low float into real8
movss temp4, thexmm
fld temp4 ;; must go to this trouble apparently
fstp thereal
    ENDM
    rcvtr2i equ cvtps2dq
    rcvti2r equ cvtdq2ps
    rxor equ xorps
    rcvt2negmask MACRO thexmm ;; convert xmm1 int's to neg masks
    pslld thexmm, 31
    ENDM
;===============================
ENDIF
;===============================

Obviously this technique can be applied to any routine, not just sin/cos, and also can be extended to include AVX.

The timing with single precision is under a nanosecond per calc. I actually got one run with less than 1 cycle per sin (!): the FASTER precision (about 5 digits) did 100 in 99 cycles on the Intel:

Code: [Select]
Intel i5 3330 2.94 Ghz

Sleeping to stabilize timings...-----------------------------------
REAL4 sin/cos SSE algorithm
BLANK LOOP total is     119

Test Function: trigSSE Sin ========================================

Cycles for 100 Calcs    163

Precision (HIGHER)
average precision       3.81e-007
worst precision         8.79e-007

Test Function: trigSSE Cos ========================================

Cycles for 100 Calcs    190

Precision (HIGHER)
average precision       4.01e-007
worst precision         8.05e-007

Test Function: trigSSE Sin ========================================

Cycles for 100 Calcs    99

Precision (FASTER)
average precision       4.38e-005
worst precision         6.79e-005

Test Function: trigSSE Cos ========================================

Cycles for 100 Calcs    121

Precision (FASTER)
average precision       4.38e-005
worst precision         6.76e-005

===================================================================
AMD A6 1.8 Ghz

Sleeping to stabilize timings...-----------------------------------
REAL4 sin/cos SSE algorithm
BLANK LOOP total is     84

Test Function: trigSSE Sin ========================================

Cycles for 100 Calcs    316

Precision (HIGHER)
average precision       3.81e-007
worst precision         8.79e-007

Test Function: trigSSE Cos ========================================

Cycles for 100 Calcs    380

Precision (HIGHER)
average precision       4.01e-007
worst precision         8.05e-007

Test Function: trigSSE Sin ========================================

Cycles for 100 Calcs    293

Precision (FASTER)
average precision       4.38e-005
worst precision         6.79e-005

Test Function: trigSSE Cos ========================================

Cycles for 100 Calcs    336

Precision (FASTER)
average precision       4.38e-005
worst precision         6.76e-005

F:\SSE single to post>

... but what exactly this means is not clear since it depends on the blank loop timing so much; usually it's more like 136 cycles or so; and AMD is close to 300. Anyway, it's pretty fast.

To-do list includes:

- sharpen the coefficients
- AVX version
- Cos Power Series

This basically accomplishes what I wanted from this thread, and I've had enough of sinus problems for the time being, so the to-do list will be put off indefinitely. Thanks to all, especially siekmanski for the LUT, qWord for the Muller algo, and jj2007 for the Chebyshev suggestion. Of course any further comments, timings, bug reports, kudos or complaints are welcome. Well, maybe not complaints.

The zip contains

trig_macros.asm: the SSE single/double routine
trig.asm: controller
test_support_macros.asm: the hard part
SSE48.inc: includes single/double macros
trig.exe
dotrig.bat: "MakeFile" batch file

Title: Re: Trigonometry ...
Post by: TWell on April 05, 2015, 06:00:07 PM
Code: [Select]
AMD E1-6010 1,35 GHz
Sleeping to stabilize timings...------------------------------------
REAL4 sin/cos SSE algorithm
BLANK LOOP total is 80

Test Function: trigSSE Sin ========================================

Cycles for 100 Calcs 427

Precision (HIGHER)
average precision 3.81e-007
worst precision    8.79e-007

Test Function: trigSSE Cos ========================================

Cycles for 100 Calcs 465

Precision (HIGHER)
average precision 4.01e-007
worst precision    8.05e-007

Test Function: trigSSE Sin ========================================

Cycles for 100 Calcs 369

Precision (FASTER)
average precision 4.38e-005
worst precision    6.79e-005

Test Function: trigSSE Cos ========================================

Cycles for 100 Calcs 429

Precision (FASTER)
average precision 4.38e-005
worst precision    6.76e-005
Title: Re: Trigonometry ...
Post by: nidud on April 05, 2015, 08:15:31 PM
AMD Athlon(tm) II X2 245 Processor (SSE3)
Code: [Select]
Sleeping to stabilize timings...------------------------------------
REAL4 sin/cos SSE algorithm
BLANK LOOP total is 111

Test Function: trigSSE Sin ========================================

Cycles for 100 Calcs 340

Precision (HIGHER)
average precision 3.81e-007
worst precision    8.79e-007

Test Function: trigSSE Cos ========================================

Cycles for 100 Calcs 425

Precision (HIGHER)
average precision 4.01e-007
worst precision    8.05e-007

Test Function: trigSSE Sin ========================================

Cycles for 100 Calcs 277

Precision (FASTER)
average precision 4.38e-005
worst precision    6.79e-005

Test Function: trigSSE Cos ========================================

Cycles for 100 Calcs 304

Precision (FASTER)
average precision 4.38e-005
worst precision    6.76e-005
Title: Re: Trigonometry ...
Post by: Siekmanski on April 05, 2015, 08:54:39 PM
Intel i7-4930K @3.4GHz

Code: [Select]
Sleeping to stabilize timings...------------------------------------
REAL4 sin/cos SSE algorithm
BLANK LOOP total is     140

Test Function: trigSSE Sin ========================================

Cycles for 100 Calcs    190

Precision (HIGHER)
average precision       3.81e-007
worst precision         8.79e-007

Test Function: trigSSE Cos ========================================

Cycles for 100 Calcs    216

Precision (HIGHER)
average precision       4.01e-007
worst precision         8.05e-007

Test Function: trigSSE Sin ========================================

Cycles for 100 Calcs    117

Precision (FASTER)
average precision       4.38e-005
worst precision         6.79e-005

Test Function: trigSSE Cos ========================================

Cycles for 100 Calcs    201

Precision (FASTER)
average precision       4.38e-005
worst precision         6.76e-005
Title: Re: Trigonometry ...
Post by: dedndave on April 05, 2015, 10:00:13 PM
P4 prescott w/htt
Code: [Select]
Sleeping to stabilize timings...------------------------------------
REAL4 sin/cos SSE algorithm
BLANK LOOP total is     135

Test Function: trigSSE Sin ========================================

Cycles for 100 Calcs    902

Precision (HIGHER)
average precision       3.81e-007
worst precision         8.79e-007

Test Function: trigSSE Cos ========================================

Cycles for 100 Calcs    977

Precision (HIGHER)
average precision       4.01e-007
worst precision         8.05e-007

Test Function: trigSSE Sin ========================================

Cycles for 100 Calcs    737

Precision (FASTER)
average precision       4.38e-005
worst precision         6.79e-005

Test Function: trigSSE Cos ========================================

Cycles for 100 Calcs    839

Precision (FASTER)
average precision       4.38e-005
worst precision         6.76e-005
Title: Re: Trigonometry ...
Post by: rrr314159 on April 06, 2015, 03:02:05 AM
thx Gunther, siekmanski, TWell, nidud, dedndave,

it's encouraging to see that the Blank Loop totals are all close, indicating stability. They dropped between the 2 versions of the program as I cleaned it up and eliminated some no-longer used portions. Blank Loop is mostly non-SSE instructions. This indicates to me that the calculation numbers pretty accurately reflect how the machines handle SSE instructions; obviously AMD, and older, machines are not as good (altho AMD is still champ for fsin,  fcos and "ftan"). BTW my Intel numbers beat siekmanski's, altho his machine is actually better; that's only because I selected the best run out of 20 or 30, a real outlier. My normal runs were behind his by a factor approx. 136 to 117. Note this is independent of clock speed; but normally the higher the clock speed, the fewer the cycles, because both are characteristic of more modern CPU's. Be interesting to do a statistical analysis ... but it's a beautiful day, Spring has sprung, and I'm outta here!

Thanks again,
Title: Re: Trigonometry ...
Post by: dedndave on April 06, 2015, 04:20:04 AM
i noticed that the test loops execute VERY quickly
if you want nice stable timings, increase the loop counts so that each test takes at least 0.5 seconds   :t
the loop count does not need to be the same for all the tests
i.e., faster code => larger loop count
Title: Re: Trigonometry ...
Post by: Siekmanski on April 06, 2015, 06:14:49 AM
I like your macros, nice style.  :t

BTW my Intel numbers beat siekmanski's, altho his machine is actually better; that's only because I selected the best run out of 20 or 30, a real outlier. My normal runs were behind his by a factor approx. 136 to 117.

Don't get fooled by my machine. I tuned it down to lower the fan-speeds so, it becomes a very quiet but still fast machine.
This is because i use it sometimes for sound recordings.  :biggrin:

EDIT:
Quote
I also did find some speed improvements in my routines ( put in some linear algebra )

Thought i replace,
        mov   edx,16384*2
        sub    edx,eax
with: xor    eax,-1
And gain some speed, was i wrong here.

 xor    eax,-1        0.86 cycles

 mov   edx,16384*2
 sub    edx,eax     together 0.53 cycles

Wasted 1 hour programming....  :(

Added a sincos routine. When i'm done testing i'll post them.
BTW i'm still curious for your faster fpu routine you mentioned a few posts before.

Title: Re: Trigonometry ...
Post by: rrr314159 on April 06, 2015, 12:28:06 PM
I like your macros, nice style.

- thanks ... BTW nidud mentioned he preferred proc's. When I want a proc for small math algos like this (for instance making a library) I write the macro and create proc's from it e.g.

sin_high_precision_single proc   ; assuming inputs are in xmm0 (or u could pass them to proc)
    _USEDOUBLE = 0
    MORE_PRECISION = 1
    trigSSE 0
ret
sin_high_precision_single endp

cos_low_precision_double proc   ; assuming inputs are in xmm0
   _USEDOUBLE = 1
   MORE_PRECISION = 0
   trigSSE 1
ret
cos_low_precision_double endp

...etc You can also pass flags to the proc (highprecisionflag:BOOL etc) and use multiple invocations within it. This way one macro is used to make all the procs you need. For maintenance just one macro needs to be fixed / changed, and recompile the proc's. Of course there are many cases where such a technique is meaningless, and I just write proc in the normal way.

What do u think of such an approach - in cases where applicable, which often happens with these small math algos - is this commonly done?

...i use it sometimes for sound recordings.  :biggrin:

- Not that it's any of my business but if u feel like mentioning, what do u record? I'm pseudo-professional bass player myself

Thought i replace,
        mov   edx,16384*2
        sub    edx,eax
with: xor    eax,-1
And gain some speed, was i wrong here.

 xor    eax,-1        0.86 cycles

 mov   edx,16384*2
 sub    edx,eax     together 0.53 cycles

- Actually "neg eax" is correct, not xor, altho depending how u use the LUT (0..16383 or 1..16384) and rounding mode xor might give the right result anyway. But I wonder if neg is faster?

Wasted 1 hour programming....  :(

- NOT wasted (IMHO). Had this argument often working in R&D - we'd spend 6 months on some possible approach, wind up proving it was no good, drove budget people crazy. They figured we could have just done nothing, same result, saved half a million bucks! But then we wouldn't know the approach didn't work. That's R&D for you.


BTW i'm still curious for your faster fpu routine you mentioned a few posts before.

- Sure - The routine is more-or-less done, but I couldn't face updating test_support_macros to try it out! That's the hard part. If u don't mind testing it yourself, here it is. From your POV, the best part is it uses normal default rounding mode. Could be a made a little faster I guess ... let me know if it works ;)

Code: [Select]
;;*******************************
;; rrr314159 2015/03/30 fpu-based Power Series for Sin and Cos
;;******************************* use default round-to-nearest
MORE_PRECISION = 1              ;; set to 0 for faster
;;*******************************
trigFPU MACRO sincosflag:=<0>   ;; input and output in st
;;*******************************
    IFNDEF cheby_0
        .const                 
            align 16
            pi real8 3.1415926535897931 ;; generic constants
            oneoverpi real8 0.31830988618379067
            piovertwo real8 1.5707963267948966
           
            cheby_0 real8 0.99999660  ;; 4 coeffs approx 10^-7
            cheby_1 real8 -0.16664824
            cheby_2 real8 0.00830629
            cheby_3 real8 -0.00018363
       
            ch0 real8 0.999695        ;; 3 coeffs approx 10^-5
            ch1 real8 -0.1656700
            ch2 real8 0.0075134
        .code
    ENDIF
   
    IF sincosflag
        fadd piovertwo
    ENDIF
   
    fmul oneoverpi              ;; div by pi to put in two hemispheres
    fld st
    fistp qword ptr [esp-8]
    mov eax, dword ptr [esp-8]  ;; eax has (lower word of) rounded quotient x / pi
    fild qword ptr [esp-8]
    fsub                        ;; now mod 1 (-.5 to .4999999) meaning -pi/2 to pi/2
    test eax, 1
    jz @F
        fchs                    ;; -x for left hemisphere
    @@:
    fmul pi                     ;; restore scale
    fld st
    fmul st, st                 ;; x^2, x
   
    IF MORE_PRECISION
        firstone cheby_3        ;; cheby_3*x^2, x^2, x
        addone cheby_2          ;; and so on
        addone cheby_1
        fxch
        fstp st
        addone cheby_0
    ELSE
        firstone ch2
        addone ch1
        fxch
        fstp st
        addone ch0
    ENDIF
    fxch
    fstp st                     ;; answer in st(0) all other regs free
ENDM
;--------------------------------
firstone MACRO __coeff:REQ     ;; utility MACRO for trigFPU
    fld __coeff
    fmul st, st(1)
ENDM
;--------------------------------
addone MACRO __coeff:REQ       ;; utility MACRO for trigFPU
    fadd __coeff
    fmul st, st(1)
ENDM
;;*******************************
Title: Re: Trigonometry ...
Post by: Siekmanski on April 06, 2015, 07:36:07 PM
What do u think of such an approach - in cases where applicable, which often happens with these small math algos - is this commonly done?

I think so, why not.
My intention is to use the trig functions for my D3Dmath.lib so, i'll create them as proc's.
Rewrite them as macros when i need some extra speed.

- Not that it's any of my business but if u feel like mentioning, what do u record? I'm pseudo-professional bass player myself

It's for realtime voice recording for an FM-broadcast proggy i'm writing. (and i just hate noisy computers)

- Actually "neg eax" is correct, not xor, altho depending how u use the LUT (0..16383 or 1..16384) and rounding mode xor might give the right result anyway. But I wonder if neg is faster?

Code: [Select]
A)  mov     edx,16384*2
    sub     edx,eax
    fld     real4 ptr [SSE2SinTableSP+edx*4-4] 

B)  xor     eax,-1
    fld     real4 ptr [SSE2SinTableSP+(16384*4*2)+eax*4] 

C)  neg     eax
    fld     real4 ptr [SSE2SinTableSP+(16384*4*2)+eax*4-4]

B is slower than A. C seems the fastest on my machine.
Have to test it and pick the fasted variant for each quadrant.

Thanks for the routine, i'll test it next week, no time for programming this week.
Title: Re: Trigonometry ...
Post by: dedndave on April 07, 2015, 11:44:25 PM
Code: [Select]
not eax
inc eax
is equiv to neg eax

xor eax,-1 is the same as not eax   :biggrin:

i would think neg eax is fastest
Title: Re: Trigonometry ...
Post by: Gunther on April 08, 2015, 01:29:38 AM
Dave,

Code: [Select]
not eax
inc eax
is equiv to neg eax

xor eax,-1 is the same as not eax   :biggrin:

That's true.

i would think neg eax is fastest

Assume nothing. We should meassure it.

Gunther
Title: Re: Trigonometry ...
Post by: nidud on April 08, 2015, 02:22:26 AM
Code: [Select]
     AMD   Intel
INC     reg            [1]   2.2   2.5
INC     r16            [2]   2.2   2.5
INC     r08            [2]   2.2   2.5
INC     mem            [6]  15.0  16.9
NEG     reg            [2]   2.2   2.5
NEG     mem            [6]  15.0  16.9
NOT     reg            [2]   2.2   2.5
NOT     mem            [6]  14.9  16.9
XOR     reg,reg        [2]   0.9   1.1
XOR     mem,reg        [6]  14.9  16.9
XOR     reg,mem        [6]   2.4   2.5
XOR     reg,imm        [3]   2.2   2.5
XOR     mem,imm        [7]  14.9  16.6
XOR     acc,imm        [3]   2.2   2.5

I will assume the size of the instruction plays a bigger role in this case, where jump distance and alignment may change. Adding instructions even if this in theory will add cycles may still produce a faster algo.

The problem with this is that it’s difficult to control the alignment in a macro where it may behave differently according to where it’s used. That’s why I prefer proc's when testing. However, if you plan on using it as a macro this may not be the best method.
Title: Re: Trigonometry ...
Post by: rrr314159 on April 08, 2015, 03:00:18 AM
@dedndave - if u remember we had approx. the same issue 3 weeks ago in this thread (http://masm32.com/board/index.php?topic=4083.0). We saw that u could substitute neg x for "dec / and" (same as "not / inc"), and mabdelouahab determined it was faster - which is very much what one would suppose.

In the present case it's used in a large LUT so the difference of "1" can be compensated in other ways; still neg is about the same speed as "not", and accurate, so why not use it.

Quote from: Gunther
Assume nothing. We should meassure it.

- mabdelouahab did measure it in that thread (almost the same instructions). While I agree with you, of course, remember there are problems with measuring - different machines and situations lead to very different results.

Quote from: nidud
I will assume the size of the instruction plays a bigger role in this case...

- read above Gunther quote! ;) U should measure whether the size of the instruction plays a bigger role - and if so, just how big? Still, at some point it gets ridiculous; one spends one's whole life measuring, never gets around to doing ...

Quote from: nidud
The problem with this is that it’s difficult to control the alignment in a macro where it may behave differently according to where it’s used...

- I just start the macro with an alignment instruction, when appropriate. Admittedly it may behave differently anyway depending how it's used. What I do, make sure resulting program performs as expected; seems it always does.

Bottom line, I think measuring at the unit level can be overdone (not testing - that can never be overdone); at some point, just do system integration and make sure the system works right. You have to do that ultimately anyway.
Title: Re: Trigonometry ...
Post by: nidud on April 08, 2015, 04:31:57 AM
U should measure whether the size of the instruction plays a bigger role - and if so, just how big?

I think that will depend more on the actual offset of the instruction, and if this is unknown (macro) that will be rather difficult.

Quote
Still, at some point it gets ridiculous; one spends one's whole life measuring, never gets around to doing ...

That’s definitely true.

On the other hand a bit irritating using days fine-tuning based on wrong assumptions, so the more accurate the test the better.
Title: Re: Trigonometry ...
Post by: Gunther on April 08, 2015, 09:15:04 AM
Hi nidud,

That’s definitely true.

On the other hand a bit irritating using days fine-tuning based on wrong assumptions, so the more accurate the test the better.

sure. But it's a short instruction sequenz to test, so RDTSC or perhaps RDTSCP (http://www.intel.com/content/dam/www/public/us/en/documents/white-papers/ia-32-ia-64-benchmark-code-execution-paper.pdf) should do a good job.

Gunther
Title: Re: Trigonometry ...
Post by: Farabi on April 10, 2015, 12:47:56 PM
Whoa, Thanks, Very cool :t jack of all trade.
Title: Re: Trigonometry ...
Post by: rrr314159 on April 13, 2015, 02:39:47 PM
Here is a version of the trig Sin / Cos algo for AVX. It's written in what-I-call "Orthogonal" instruction set, so that one algo works for AVX or SSE, double or single precision. Each instruction is a macro which implements one of the four possibilities depending on flags.

The AVX implementation is 30 or 40 percent faster, so I guess it's worth it. Still it's so much trouble it's not clear. Without AVX2 u have to shuffle between high and low lanes a lot, using such commands as vextractf128, vinsertf128, vpunpckhdq, vpunpckldq, vshufd, and vperm2f128. You often have to perform ops on high and low lanes separately, temp-store in memory, other annoying techniques.

Please let me know if there are better ways to do macros like rhadd, rhmax, rmovs8, and especially rcvt2negmask.

Is AVX worth this trouble? Since AVX2 has only been around a couple of years it will not be in general use for a long time, so I can't count on its better integer instructions. AVX seems to have similar status to 64-bit - sounds good but one can do without it. How many of you simply don't bother with AVX?

The program trig.exe runs only higher-precision sin and cos, for all 4 Instruction Sets. I'm not bothering to show timings but the zip includes them. In the zip are

trig.asm - see control values at top for different ways to run it
test_support_macros.asm - uses Orthogonal instructions
trig_macros.asm - the actual trig algo very simple compared to supporting routines
localmacs.asm - mainly for printing SSE/AVX registers pretty messy
SSEAVX48.asm - the "Orthogonal" macros
trig.exe
dotrig.bat - "Make" batch file
trigtimings AVXSSE single double.out - timings results

Here are these macros for AVX/SSE single/double.

Code: [Select]
;; need data statements because 7fffffffffffffffh is too long a constant
;; and for temp data storage
IFNDEF absmask8
.const
    absmask8 dq 4 dup (7fffffffffffffffh)   ; NOT the sign bit, for double
    absmask4 dd 8 dup (7fffffffh)           ; NOT the sign bit, for single
    negmask8 dq 4 dup (8000000000000000h)   ; the sign bit, for double
    negmask4 dd 8 dup (80000000h)           ; the sign bit, for single
.data?
    align 16
    temp8 real8 4 dup(?)
    temp4 real4 8 dup(?)
.code
ENDIF

_DOUBLE equ 1
_SINGLE equ 0
_AVX equ 1
_SSE equ 0
ISTypeXY textequ <X>
WORD_SIZE = 4
ISType textequ <X4>

scratchrmm equ rmm3
scratchxmm equ xmm3

;;*******************************
;;Define all the equates used in an "Orthogonal" algo
;;*******************************
ChangeInstructionSet MACRO _USEDOUBLE:=<0>, _USEAVX:=<0>
;;*******************************
    IF _USEAVX
        rmm0 TEXTEQU <ymm0>
        rmm1 TEXTEQU <ymm1>
        rmm2 TEXTEQU <ymm2>
        rmm3 TEXTEQU <ymm3>
        rmm4 TEXTEQU <ymm4>
        rmm5 TEXTEQU <ymm5>
        rmm6 TEXTEQU <ymm6>
        rmm7 TEXTEQU <ymm7>
    ELSE
        rmm0 TEXTEQU <xmm0>
        rmm1 TEXTEQU <xmm1>
        rmm2 TEXTEQU <xmm2>
        rmm3 TEXTEQU <xmm3>
        rmm4 TEXTEQU <xmm4>
        rmm5 TEXTEQU <xmm5>
        rmm6 TEXTEQU <xmm6>
        rmm7 TEXTEQU <xmm7>
    ENDIF
   
    ;; "O" generally refers to xmm or ymm register
   
    IF _USEAVX
        O_SIZE = 32
        rmmword equ <ymmword>
        op equ ymmword ptr
        ISTypeXY textequ <Y>         ;; instruction set type
    ELSE
        O_SIZE = 16
        rmmword textequ <xmmword>
        op equ xmmword ptr
        ISTypeXY textequ <X>
    ENDIF
   
    WORD_SIZE = 4 * (_USEDOUBLE + 1)

    CHUNK = O_SIZE/WORD_SIZE          ;; (SSE) CHUNK = 2 double, 4 single
    WORD_SIZE$ textequ %WORD_SIZE       ;; AVX = 4 and 8
    ISType textequ ISTypeXY,WORD_SIZE$    ;; ISType is X4, X8, Y4, Y8
    rtype textequ <real>,WORD_SIZE$
    rptr textequ <real>,WORD_SIZE$,< ptr>
    tempreal textequ <temp>,WORD_SIZE$
    absmask textequ <absmask>,WORD_SIZE$
    negmask textequ <negmask>,WORD_SIZE$

ENDM
;;*******************************
;================================
rmov MACRO x1, x2
    IFIDN ISType, <X4>
        movups x1, x2
    ELSEIFIDN ISType, <X8>
        movupd x1, x2
    ELSEIFIDN ISType, <Y4>
        vmovups x1, x2
    ELSE    ;; Y8
        vmovupd x1, x2
    ENDIF
ENDM
;================================
radd MACRO x1, x2
    IFIDN ISType, <X4>
        addps x1, x2
    ELSEIFIDN ISType, <X8>
        addpd x1, x2
    ELSEIFIDN ISType, <Y4>
        vaddps x1, x1, x2
    ELSE    ;; Y8
        vaddpd x1, x1, x2
    ENDIF
ENDM
;================================
rsub MACRO x1, x2
    IFIDN ISType, <X4>
        subps x1, x2
    ELSEIFIDN ISType, <X8>
        subpd x1, x2
    ELSEIFIDN ISType, <Y4>
        vsubps x1, x1, x2
    ELSE    ;; Y8
        vsubpd x1, x1, x2
    ENDIF
ENDM
;================================
rmul MACRO x1, x2
    IFIDN ISType, <X4>
        mulps x1, x2
    ELSEIFIDN ISType, <X8>
        mulpd x1, x2
    ELSEIFIDN ISType, <Y4>
        vmulps x1, x1, x2
    ELSE    ;; Y8
        vmulpd x1, x1, x2
    ENDIF
ENDM
;================================
rsqrsrcintodest MACRO x1, x2
    IFIDN ISType, <X4>
        movups x1, x2
        mulps x1, x1
    ELSEIFIDN ISType, <X8>
        movupd x1, x2
        mulpd x1, x1
    ELSEIFIDN ISType, <Y4>
        vmulps x1, x2, x2
    ELSE
        vmulpd x1, x2, x2
    ENDIF
ENDM
;================================
rmax MACRO x1, x2
    IFIDN ISType, <X4>
        maxps x1, x2
    ELSEIFIDN ISType, <X8>
        maxpd x1, x2
    ELSEIFIDN ISType, <Y4>
        vmaxps x1, x1, x2
    ELSE    ;; Y8
        vmaxpd x1, x1, x2
    ENDIF
ENDM
;================================
rand MACRO x1, x2
    IFIDN ISType, <X4>
        andps x1, x2
    ELSEIFIDN ISType, <X8>
        andpd x1, x2
    ELSEIFIDN ISType, <Y4>
        vandps x1, x1, x2
    ELSE    ;; Y8
        vandpd x1, x1, x2
    ENDIF
ENDM
;================================
rxor MACRO x1, x2
    IFIDN ISType, <X4>
        xorps x1, x2
    ELSEIFIDN ISType, <X8>
        xorpd x1, x2
    ELSEIFIDN ISType, <Y4>
        vxorps x1, x1, x2
    ELSE    ;; Y8
        vxorpd x1, x1, x2
    ENDIF
ENDM
;================================
rhmax MACRO x1
    IFIDN ISType, <X4>
movhlps scratchrmm, x1
maxps x1, scratchrmm
pshufd scratchrmm, x1, 55h
maxps x1, scratchrmm
    ELSEIFIDN ISType, <X8>
pshufd scratchrmm, x1, 8dh
maxpd x1, scratchrmm
    ELSEIFIDN ISType, <Y4>
        vperm2f128 scratchrmm, x1, x1, 23h
        vmaxps x1, x1, scratchrmm
        vshufps scratchrmm, x1, x1, 04eh
        vmaxps x1, x1, scratchrmm
        vshufps scratchrmm, x1, x1, 0bbh
        vmaxps x1, x1, scratchrmm
    ELSE
        vperm2f128 scratchrmm, x1, x1, 23h
        vmaxpd x1, x1, scratchrmm
        vshufpd scratchrmm, x1, x1, 005h
        vmaxpd x1, x1, scratchrmm
    ENDIF
ENDM
;================================
rhadd MACRO x1
    IFIDN ISType, <X4>
haddps x1, x1
haddps x1, x1
    ELSEIFIDN ISType, <X8>
haddpd x1, x1
    ELSEIFIDN ISType, <Y4>
        vperm2f128 scratchrmm, x1, x1, 23h
        vaddps x1, x1, scratchrmm     ; can also be vhaddps
        vhaddps x1, x1, scratchrmm
        vhaddps x1, x1, scratchrmm
    ELSE
        vperm2f128 scratchrmm, x1, x1, 23h
        vaddpd x1, x1, scratchrmm
        vhaddpd x1, x1, scratchrmm
    ENDIF
ENDM
;================================
rmovs8 MACRO thereal, x1 ;; scalar move low float into real8
    IFIDN ISType, <X4>
movss temp4, x1
fld temp4 ;; must go to this trouble apparently
fstp thereal
    ELSEIFIDN ISType, <X8>
movsd thereal, x1
    ELSEIFIDN ISType, <Y4>
% vmovss temp4, &XMMfromYMM(x1)
fld temp4 ;; must go to this trouble apparently
fstp thereal
    ELSE
% vmovsd thereal, &XMMfromYMM(x1)
    ENDIF
ENDM
;================================
rcvtr2i MACRO x1, x2
    IFIDN ISType, <X4>
        cvtps2dq x1, x2
    ELSEIFIDN ISType, <X8>
cvtpd2dq x1, x2
    ELSEIFIDN ISType, <Y4>
        vcvtps2dq x1, x2
    ELSE
% vcvtpd2dq &XMMfromYMM(x1), x2
    ENDIF
ENDM
;================================
rcvti2r MACRO x1, x2
    IFIDN ISType, <X4>
        cvtdq2ps x1, x2
    ELSEIFIDN ISType, <X8>
cvtdq2pd x1, x2
    ELSEIFIDN ISType, <Y4>
        vcvtdq2ps x1, x2
    ELSE
vcvtdq2pd x1, x2           ;; note using only low 128 bits of x2
    ENDIF
ENDM
;================================
rcvt2negmask MACRO x1              ;; convert xmm1 int's to neg masks
LOCAL thexmm
    IFIDN ISType, <X4>
    pslld x1, 31
    ELSEIFIDN ISType, <X8>
    pslld x1, 31
    pshufd x1, x1, 073h       
    ELSEIFIDN ISType, <Y4>
        vextractf128 scratchxmm, x1, 1
%        thexmm equ &XMMfromYMM(x1)
        vpslld scratchxmm, scratchxmm, 31
        vpslld thexmm, thexmm, 31
        vinsertf128 x1, x1, scratchxmm, 1
    ELSE                   ;; NOTE using xmm2 as scratch also = cheating
        vxorps scratchxmm, scratchxmm, scratchxmm
%        thexmm equ &XMMfromYMM(x1)
        vpslld thexmm, thexmm, 31
        vpunpckhdq xmm2, scratchxmm, thexmm
        vpunpckldq thexmm, scratchxmm, thexmm
        vinsertf128 x1, x1, xmm2, 1
    ENDIF
ENDM
;================================
rpxor MACRO x1, x2              ;; only being used to zero
    IFIDN ISTypeXY, <X>
        pxor x1, x2
    ELSE
        vxorps x1, x1, x2        ;; vpxor is only AVX2
    ENDIF
ENDM
;================================

What do you think, is AVX worth the trouble? If so, is this "orthogonal instruction set" idea worth it?
Title: Re: Trigonometry ...
Post by: Gunther on April 13, 2015, 06:02:56 PM
Hi rrr,

What do you think, is AVX worth the trouble? If so, is this "orthogonal instruction set" idea worth it?

my machine at the university doesn't support AVX. So I can test your program this evening at home. I think that AVX is worth the effort. AVX512 is coming soon, too.

Gunther
Title: Re: Trigonometry ...
Post by: Siekmanski on April 15, 2015, 03:57:30 AM
Hi, i'm back from a week without programming.
Did improve the speed of the LUT approach.
Although the LUT is a little faster for calculating one value, i must say i like the Chebyshev approximation a lot more.
More precision and easy to use with SIMD instructions.
So, now i'll try to make some more trig functions for my D3Dmath.lib

Hi rrr, here is another one with much higher precision, a 9th degree polynomial and only 5 coeffs ( maximum error of about 3.3381e-9 )

Code: [Select]
.data
align 16
OneDivPiDP      real8 2 dup (0.31830988618379067153776752674)
PiDP            real8 2 dup (3.14159265358979323846264338327)

ChebyCh0        real8 2 dup (0.99999997658988206732793421604)
ChebyCh1        real8 2 dup (-0.16666647634639712527586027070)
ChebyCh2        real8 2 dup (0.00833289982335175125347370686)
ChebyCh3        real8 2 dup (-0.00019800897762795431268299998)
ChebyCh4        real8 2 dup (0.00000259048850053605227412420)

.code

SSE2cheby9 proc

    mulpd       xmm0,oword ptr OneDivPiDP   ; 1/pi to get a 1pi range
    cvtpd2dq    xmm2,xmm0                   ; (2 packed dpfp to 2 even int32) lose the fractional parts and keep it in xmm2 to save the signs
    cvtdq2pd    xmm1,xmm2                   ; (2 even packed int32 to 2 dpfp) save the integral parts
    subpd       xmm0,xmm1                   ; now it's inside the range, results are values between -0.5 to 0.4999999
    pslld       xmm2,31
    pshufd      xmm2,xmm2,73h         
    xorpd       xmm0,xmm2                   ; set sign-bits
    mulpd       xmm0,oword ptr PiDP         ; restore ranges between -1/2 pi to +1/2 pi

; Now do the Chebyshev approximation of a 9th degree polynomial
; With only 5!! optimized constants for a maximum error of about 3.3381e-9 over -1/2 pi to +1/2 pi

    movapd      xmm2,xmm0
    mulpd       xmm2,xmm2

    movapd      xmm1,oword ptr ChebyCh4
    mulpd       xmm1,xmm2
    addpd       xmm1,oword ptr ChebyCh3
    mulpd       xmm1,xmm2
    addpd       xmm1,oword ptr ChebyCh2
    mulpd       xmm1,xmm2
    addpd       xmm1,oword ptr ChebyCh1
    mulpd       xmm1,xmm2
    addpd       xmm1,oword ptr ChebyCh0
    mulpd       xmm0,xmm1
    ret
SSE2cheby9 endp
Title: Re: Trigonometry ...
Post by: Gunther on April 15, 2015, 04:23:07 AM
Hi rrr,

here are your AVX results. I hope that helps.
Code: [Select]
Sleeping to stabilize timings...------------------------------------
=========================================
REAL8 sin/cos AVX algorithm
=========================================
BLANK LOOP total is     113

Test Function: trig Sin ========================================

Cycles for 100 Calcs    231

Precision (HIGHER)
average precision       3.83e-007
worst precision         6.27e-007

Test Function: trig Cos ========================================

Cycles for 100 Calcs    252

Precision (HIGHER)
average precision       3.83e-007
worst precision         6.27e-007
=========================================
REAL4 sin/cos AVX algorithm
=========================================
BLANK LOOP total is     19

Test Function: trig Sin ========================================

Cycles for 100 Calcs    145

Precision (HIGHER)
average precision       3.96e-007
worst precision         8.34e-007

Test Function: trig Cos ========================================

Cycles for 100 Calcs    154

Precision (HIGHER)
average precision       4.39e-007
worst precision         9.09e-007
=========================================
REAL8 sin/cos SSE algorithm
=========================================
BLANK LOOP total is     118

Test Function: trig Sin ========================================

Cycles for 100 Calcs    460

Precision (HIGHER)
average precision       3.83e-007
worst precision         6.27e-007

Test Function: trig Cos ========================================

Cycles for 100 Calcs    541

Precision (HIGHER)
average precision       3.83e-007
worst precision         6.27e-007
=========================================
REAL4 sin/cos SSE algorithm
=========================================
BLANK LOOP total is     57

Test Function: trig Sin ========================================

Cycles for 100 Calcs    205

Precision (HIGHER)
average precision       3.81e-007
worst precision         8.79e-007

Test Function: trig Cos ========================================

Cycles for 100 Calcs    219

Precision (HIGHER)
average precision       4.01e-007
worst precision         8.05e-007

Gunther
Title: Re: Trigonometry ...
Post by: TWell on April 15, 2015, 04:38:14 AM
Code: [Select]
AMD E1-6010 1,35 GHz

Sleeping to stabilize timings...------------------------------------
=========================================
REAL8 sin/cos AVX algorithm
=========================================
BLANK LOOP total is 106

Test Function: trig Sin ========================================

Cycles for 100 Calcs 987

Precision (HIGHER)
average precision 3.83e-007
worst precision    6.27e-007

Test Function: trig Cos ========================================

Cycles for 100 Calcs 977

Precision (HIGHER)
average precision 3.83e-007
worst precision    6.27e-007
=========================================
REAL4 sin/cos AVX algorithm
=========================================
BLANK LOOP total is 36

Test Function: trig Sin ========================================

Cycles for 100 Calcs 444

Precision (HIGHER)
average precision 3.96e-007
worst precision    8.34e-007

Test Function: trig Cos ========================================

Cycles for 100 Calcs 444

Precision (HIGHER)
average precision 4.39e-007
worst precision    9.09e-007
=========================================
REAL8 sin/cos SSE algorithm
=========================================
BLANK LOOP total is 213

Test Function: trig Sin ========================================

Cycles for 100 Calcs 1166

Precision (HIGHER)
average precision 3.83e-007
worst precision    6.27e-007

Test Function: trig Cos ========================================

Cycles for 100 Calcs 1239

Precision (HIGHER)
average precision 3.83e-007
worst precision    6.27e-007
=========================================
REAL4 sin/cos SSE algorithm
=========================================
BLANK LOOP total is 62

Test Function: trig Sin ========================================

Cycles for 100 Calcs 468

Precision (HIGHER)
average precision 3.81e-007
worst precision    8.79e-007

Test Function: trig Cos ========================================

Cycles for 100 Calcs 506

Precision (HIGHER)
average precision 4.01e-007
worst precision    8.05e-007
Title: Re: Trigonometry ...
Post by: Siekmanski on April 15, 2015, 05:22:08 AM
Code: [Select]
Sleeping to stabilize timings...------------------------------------
=========================================
REAL8 sin/cos AVX algorithm
=========================================
BLANK LOOP total is     96

Test Function: trig Sin ========================================

Cycles for 100 Calcs    320

Precision (HIGHER)
average precision       3.83e-007
worst precision         6.27e-007

Test Function: trig Cos ========================================

Cycles for 100 Calcs    364

Precision (HIGHER)
average precision       3.83e-007
worst precision         6.27e-007
=========================================
REAL4 sin/cos AVX algorithm
=========================================
BLANK LOOP total is     34

Test Function: trig Sin ========================================

Cycles for 100 Calcs    175

Precision (HIGHER)
average precision       3.96e-007
worst precision         8.34e-007

Test Function: trig Cos ========================================

Cycles for 100 Calcs    175

Precision (HIGHER)
average precision       4.39e-007
worst precision         9.09e-007
=========================================
REAL8 sin/cos SSE algorithm
=========================================
BLANK LOOP total is     133

Test Function: trig Sin ========================================

Cycles for 100 Calcs    606

Precision (HIGHER)
average precision       3.83e-007
worst precision         6.27e-007

Test Function: trig Cos ========================================

Cycles for 100 Calcs    663

Precision (HIGHER)
average precision       3.83e-007
worst precision         6.27e-007
=========================================
REAL4 sin/cos SSE algorithm
=========================================
BLANK LOOP total is     63

Test Function: trig Sin ========================================

Cycles for 100 Calcs    252

Precision (HIGHER)
average precision       3.81e-007
worst precision         8.79e-007

Test Function: trig Cos ========================================

Cycles for 100 Calcs    284

Precision (HIGHER)
average precision       4.01e-007
worst precision         8.05e-007
Title: Re: Trigonometry ...
Post by: rrr314159 on April 15, 2015, 10:59:12 AM
Thanks Gunther, TWell, siekmanski, farabi,

Gunther I can see why you figure AVX is worth it you get great improvement over SSE, more than as twice as fast in some cases! ... but of course these timings can't be entirely trusted. farabi is right to approve your RDTSCP timing recommendation, also using DOS would help. Anyway these numbers are good enuff to give the basic idea, AVX is certainly faster.

When I ask "is it worth it?" however I'm not only thinking of the time improvement, or even the extra work figuring out AVX, I'm mainly worried about maintenance. AVX adds a bunch of complicated instructions; when bugs crop up, they will prob be found there. Also many computers can't use AVX so need to include the CPUID test. In future one must add AVX2 and AVX512 ... it's tempting to just leave it at SSE and figure "good enough" ... but really it's not.

That's why I made these macros, so I can use one algorithm for all 4 instruction sets. And they will easily expand to AVX512 etc in the future. I haven't seen anyone else doing this, maybe there's some gotcha I'm missing? I'm intending to start a new thread concentrating on this "Orthogonal Instruction Set" idea and hope someone has a better way to do it, or can tell me what's wrong with it. I'm making a new algo (Complex Rational Polynomials) to give these macros a good work-out.

TWell, as usual AMD behaves different, AVX improvement is less. My AMD is about the same. Someday I'd like to study AMD peculiarities. They're better than Intel in some ways, worse in others.

siekmanski (referring to your previous post) thanks for examining that algo it really helps to have someone take a close look at it. I looked on net for more Chebyshev coefficients but couldn't find them. The ones I have give 5 and 7 digits precision; as it happens, for my current application, 6 is right, so I was happy with these. But the 5-coeffs to give 9 digits is good to have. If you have more, please post them, I'd particularly like to see coefficients for Cosine? But higher levels would be useful (6, 7 coeffs) also more precise versions of the 3 and 4 coeffs I'm using. But oddly enough, more precise coeffs don't seem to help? If you chop your 5-coeff set down to only 7 significant digits I think it may give about the same precision, 3.3381e-9?

You seem surprised at how good the 5 coeffs are - and yes, they are - but that's characteristic of sin (and cos). Sin is an "odd" function, antisymmetric around 0, so a power series approximation uses only odd powers: x, x^3, x^5, x^7, x^9 etc. So each extra coeff improves the approximation by (on the order of, roughly) 2 digits. So 3 cheby coeffs gives 5 digits, 4 gives 7, and 5 coeffs is up to 9 digits precision. (Whereas if a function is not odd - or even, like Cosine - each extra coeff only adds one power of x, roughly 1 digit instead of 2.) To get all the way to 17 (like fsin) would take 9 coeffs. This is just a naive estimate ignoring various factors but probably more or less correct. It seems that using this technique we can beat fsin by a large margin at the same precision which makes a real mystery, why Intel fsin isn't a lot faster? Of course AMD is; presumably they're using this cheby power series or something similar. As far as I can see Intel screwed up, but there must be more to the story than that.

I looked into computing cheby coeff's but if you have a source for them pls post them save a bunch of math. (Which of course is easy if you're on top of it, but it's been 40 years since I was). BTW I think I may be able to improve on Chebyshev, for our specific purposes, quite a bit ... let you know if so.

As I say I'm glad you studied the algo but I think you've made a mistake? I'm using

Code: [Select]
    pslld xmm2, 31
    pshufd xmm2, xmm2, 73h         ;; convert xmm2 int's to neg masks

and u thought you could improve that by

Code: [Select]
psllq xmm2, 63   ; put sign-bits in position, to place values in the right hemispheres
on the face of it that seems good but no, it doesn't work? Because the two dwords are packed into the lower 64 bits. If they were in 0..31 and 63..95 it would work but they're not. One of us is missing something?

thanks all,






Title: Re: Trigonometry ...
Post by: Siekmanski on April 15, 2015, 04:36:41 PM
Hi rrr,

Quote
I looked on net for more Chebyshev coefficients but couldn't find them.

I found 2 interesting sites,

Chebyshev polynomial using the Clenshaw algorithm:
http://goddard.net.nz/files/js/chebyshev/cheby.html  ( nice, but not so fast )

tried it out,
Code: [Select]
clen1_7 real8 -0.007090640427900377
clen2_7 real8 0.10428821903870182
clen3_7 real8 -0.6669167678906422
clen4_7 real8 0.5692306871706391

.code

SSE2clen2 proc
    mulsd       xmm0,FLT8(0.31830988618379067153776752674)  ; 1 / PI
    cvtsd2si    eax,xmm0   
    cvtsi2sd    xmm3,eax
    subsd       xmm0,xmm3
    test    eax,1
    jz      hemisphere
    xorpd       xmm0,oword ptr NegDP
hemisphere:
;Polynomial Order 7
    movsd       xmm2,xmm0               ; xmm0 = x
    addsd       xmm2,xmm2               ; xmm2 = xt2;

    movsd       xmm1,real8 ptr clen1_7  ; xmm1 b7 = b_7 = -0.007090640427900377;
    movsd       xmm3,xmm1
    mulsd       xmm3,xmm2               ; xmm3 b6 = b_7 * xt2;

    movsd       xmm4,xmm3
    mulsd       xmm4,xmm2
    addsd       xmm4,real8 ptr clen2_7
    subsd       xmm4,xmm1               ; xmm4 b5 = b_6 * xt2 + 0.10428821903870182 - b_7;

    movsd       xmm1,xmm4
    mulsd       xmm1,xmm2
    subsd       xmm1,xmm3               ; xmm1 b4 = b_5 * xt2 - b_6

    movsd       xmm3,xmm1
    mulsd       xmm3,xmm2
    addsd       xmm3,real8 ptr clen3_7
    subsd       xmm3,xmm4               ; xmm3 b3 = b_4 * xt2 + -0.6669167678906422 - b_5

    movsd       xmm4,xmm3
    mulsd       xmm4,xmm2
    subsd       xmm4,xmm1               ; xmm3 b2 = b_3 * xt2 - b_4;

    movsd       xmm1,xmm4
    mulsd       xmm1,xmm2
    addsd       xmm1,real8 ptr clen4_7
    subsd       xmm1,xmm3               ; xmm1 b1 = b_2 * xt2 + 0.5692306871706391 - b_3;
    mulsd       xmm0,xmm1   
    subsd       xmm0,xmm4               ; xmm0 b0 = x * b_1 - b_2;
    ret

SSE2clen2 endp

And this one is our goldmine i think,
http://lolengine.net/wiki/doc/maths/remez

That guy has C++ source code to calculate Sin, Cos, Tan and Exp coeffs at extreme precision.
Those are of my interest ( for my mathlib ), unfortunatly i don't have a C++ compiler to make use of it.....
Maybe you can calculate the coeffs for us ?

This is where i found the coeffs for the 9th order polynomial.

There is even a faster one with 4 coeffs,

Code: [Select]
double fastsin2(double x)
{
    const double a3 = -1.666665709650470145824129400050267289858e-1;
    const double a5 = 8.333017291562218127986291618761571373087e-3;
    const double a7 = -1.980661520135080504411629636078917643846e-4;
    const double a9 = 2.600054767890361277123254766503271638682e-6;

    return x + x*x*x * (a3 + x*x * (a5 + x*x * (a7 + x*x * a9))));
}


Quote
As I say I'm glad you studied the algo but I think you've made a mistake? I'm using

    pslld xmm2, 31
    pshufd xmm2, xmm2, 73h         ;; convert xmm2 int's to neg masks

and u thought you could improve that by

psllq xmm2, 63   ; put sign-bits in position, to place values in the right hemispheres

on the face of it that seems good but no, it doesn't work? Because the two dwords are packed into the lower 64 bits. If they were in 0..31 and 63..95 it would work but they're not. One of us is missing something?

This is ok,
Code: [Select]
    cvtpd2dq    xmm3,xmm0   ; (2 packed dpfp to 2 even int32) lose the fractional parts and keep it in xmm3 to save the signs
    psllq       xmm3,63     ; put sign-bits in position, to place values in the right hemispheres

cvtpd2dq converts 2 packed doubles to even int32 (bit 0-31 and 64-95)
cvtpd2pi converts 2 packed doubles to 2 int32 (bit 0-31 and 32-63)

Also i noticed something strange with the packed conversion instructions, in certain combinations it's better not to use xmm1 and xmm2. (xmm3 is much faster ????)
Maybe this is an intel issue ?.

When i was looking for more speed and studied different combinations, for example this piece of code below to calculate 1 double.
Try to change xmm3 in xmm1 or xmm2 and you'll notice extreme speed differences ?? (at least on my machine...)

Code: [Select]
SSE2cheby proc
   
    mulsd       xmm0,FLT8(0.31830988618379067153776752674)
    cvtsd2si    eax,xmm0   
    cvtsi2sd    xmm3,eax
    subsd       xmm0,xmm3
    test    eax,1
    jz      hemisphere
    xorpd       xmm0,oword ptr NegDP
hemisphere:
    mulsd       xmm0,FLT8(3.14159265358979323846264338327)

    movsd       xmm2,xmm0
    mulsd       xmm2,xmm2

    movsd       xmm1,real8 ptr cheby3
    mulsd       xmm1,xmm2
    addsd       xmm1,real8 ptr cheby2
    mulsd       xmm1,xmm2
    addsd       xmm1,real8 ptr cheby1
    mulsd       xmm1,xmm2
    addsd       xmm1,real8 ptr cheby0
    mulsd       xmm0,xmm1
    ret

SSE2cheby endp

Marinus
Title: Re: Trigonometry ...
Post by: Gunther on April 15, 2015, 07:11:26 PM
rrr,

When I ask "is it worth it?" however I'm not only thinking of the time improvement, or even the extra work figuring out AVX, I'm mainly worried about maintenance. AVX adds a bunch of complicated instructions; when bugs crop up, they will prob be found there. Also many computers can't use AVX so need to include the CPUID test. In future one must add AVX2 and AVX512 ... it's tempting to just leave it at SSE and figure "good enough" ... but really it's not.

We could go this way: An application has to figure out, which instruction set is available. The rest is done via different code paths. Intel calls this technique CPU dispatching (https://software.intel.com/en-us/articles/intel-integrated-performance-primitives-intel-ipp-understanding-cpu-optimized-code-used-in-intel-ipp). Of course, that can be done with macros, but this isn't real necessary.

Marinus,

And this one is our goldmine i think,
http://lolengine.net/wiki/doc/maths/remez

yes, it is.

On the other hand, have you guys ever thought about CORDIC algorithms (http://en.wikipedia.org/wiki/CORDIC)? It's an old technique (mainframe remainder), but good.

Gunther
Title: Re: Trigonometry ...
Post by: rrr314159 on April 15, 2015, 07:12:45 PM
This is ok,
Code: [Select]
    cvtpd2dq    xmm3,xmm0   ; (2 packed dpfp to 2 even int32) lose the fractional parts and keep it in xmm3 to save the signs
    psllq       xmm3,63     ; put sign-bits in position, to place values in the right hemispheres

cvtpd2dq converts 2 packed doubles to even int32 (bit 0-31 and 64-95)
cvtpd2pi converts 2 packed doubles to 2 int32 (bit 0-31 and 32-63)

- Some misunderstanding. According to my copy of Intel Reference Manual cvtpd2pq packs them in the low 64 bits, while cvtpd2pi uses MMX as source so it's not applicable here. Just run your algo with the same values in the two double precisions in xmm0: 3.0, 3.0. It answers 0.14112, -0.14112 - exactly what I would expect. The first is correct, not the second. Those are the results I get, anyway. Is it possible you've got AVX2 (or some other instruction set) and get different results?

I found those Cheby ref's also. The first, goddard, seemed all wrong; now I see it's using a different algo (Clenshaw) which I didn't notice b4. Since you say it's slower I guess I won't bother with it. The second, looked good but required work! At least we have a few sets of coeff's to use, I'm going to hope to find the rest of them already calc'ed somewhere, b4 going ahead and calc'ing them. Maybe Gunther would enjoy the exercise? :biggrin:

I'll get around to checking out odd behavior with xmm3, let u know what I find ...



Title: Re: Trigonometry ...
Post by: rrr314159 on April 15, 2015, 07:33:57 PM
Hi Gunther,

CPU dispatching is a lot more trouble than just using SSE! I'm already doing primitive dispatching: I just check if the computer has AVX, I ignore more advanced AVX2 etc, and assume he's got at least SSE2. Intel's C compiler Dispatcher is much more sophisticated.

I use macros not for CPU dispatching, but for coding the multiple algo's required. The CPU dispatcher checks CPUID and decides which version of an algo to run on a given machine: AVX, SSE2,3,4 ..., non-SIMD, whatever (more than a dozen possibilities). It's only one piece of code, used once per run, you could use a macro but there's no point. But CPU dispatching entails having multiple versions of the same algo to call: one for AVX, one for SSE, one for old non-SIMD instructions, etc. "Orthogonal macros" (TM ;) ) allow you to write ONE algo for all these. Then via a couple flags, tailor it for AVX, SSE (which flavor), etc. The alternative - which I did originally - is to have 4 (could be more) separate algo's all with exactly the same logic but slightly different instructions. Like, one uses "movups" another uses "vmovupd". It's a horrible duplication of effort, 4 times as much code to check, 4 times the possibility for error, 4 times as many changes to make when maintaining etc. That's what the macros are for, they have nothing to do with Dispatching, rather they handle the decision made by the CPU dispatcher.

CORDIC did not seem applicable, as I remember it was good when we didn't have a fast FPU (in fact, no FPU at all) in the old days but not today. Maybe I'm misremembering?

Title: Re: Trigonometry ...
Post by: Gunther on April 15, 2015, 08:58:11 PM
rrr,

CPU dispatching is a lot more trouble than just using SSE!

of course, no doubt about that.

The alternative - which I did originally - is to have 4 (could be more) separate algo's all with exactly the same logic but slightly different instructions.

that was my idea - different code paths. For example:

CORDIC did not seem applicable, as I remember it was good when we didn't have a fast FPU (in fact, no FPU at all) in the old days but not today. Maybe I'm misremembering?

Yes, at the first glance. But since AVX2 we've also powerful integer instructions. Since CORDIC uses only simple shift, addittion and subtraction it could have a surprising rebirth.

Gunther
Title: Re: Trigonometry ...
Post by: Siekmanski on April 15, 2015, 08:59:57 PM
This is ok,
Code: [Select]
    cvtpd2dq    xmm3,xmm0   ; (2 packed dpfp to 2 even int32) lose the fractional parts and keep it in xmm3 to save the signs
    psllq       xmm3,63     ; put sign-bits in position, to place values in the right hemispheres

cvtpd2dq converts 2 packed doubles to even int32 (bit 0-31 and 64-95)
cvtpd2pi converts 2 packed doubles to 2 int32 (bit 0-31 and 32-63)

- Some misunderstanding. According to my copy of Intel Reference Manual cvtpd2pq packs them in the low 64 bits, while cvtpd2pi uses MMX as source so it's not applicable here. Just run your algo with the same values in the two double precisions in xmm0: 3.0, 3.0. It answers 0.14112, -0.14112 - exactly what I would expect. The first is correct, not the second. Those are the results I get, anyway. Is it possible you've got AVX2 (or some other instruction set) and get different results?

No, you are correct.  :t

I have been mislead by the book of James C.Leiterman "32/64-bit 80x86 Assembly Language Architecture" (use this as my reference book)
Page 167 and 168 must be wrong then.
And ofcourse i tested it with values below 1.56 .... and i didn't noticed it.  :biggrin:
 
Title: Re: Trigonometry ...
Post by: rrr314159 on April 16, 2015, 08:01:28 AM
Quote from: Gunther
that was my idea - different code paths. For example:
1.traditional with the FPU
2.SSE
3.AVX and AVX2

The actual executable software must have different code paths for speed; you can't be making decisions at runtime like "IF SSE then movups, ELSE IF AVX then vmovups" - or whatever - it would slow down to a crawl. But the source code can mix the "code paths" as long as these decisions are made at compile time not run time. That's what my macros do.

BTW I've never seen software with multiple code paths for Single precision vs. Double. Everyone seems to follow the rule "Use single if you can, double if you must" - if you need the extra precision. But I treat single vs. double like AVX vs. SSE (or whatever). I include both code paths and dynamically decide which to use depending on the user's current precision requirements. Single is so much faster, (with SIMD, not FPU), that I find it very worthwhile.

I'll be posting a comprehensive example soon (I hope) to clarify these concepts.

Quote from: Gunther
...since AVX2 we've also powerful integer instructions. Since CORDIC uses only simple shift, addittion and subtraction it could have a surprising rebirth.

Well I'm happy with the speed we've achieved for sin/cos using current approach, but this brings up an idea. All these algos are very floating-point intensive, the integer units are sitting there idle for the most part. I bet if we find something useful to do with integers, by interleaving those instructions with the floating point they'd execute in parallel, almost for free. I always keep those idle integer units in mind: maybe CORDIC, or something, will put them to work.

Quote from: siekmanski
I have been mislead by the book of James C.Leiterman "32/64-bit 80x86 Assembly Language Architecture" (use this as my reference book) Page 167 and 168 must be wrong then.

I don't trust such books never saw one without quite a few mistakes, particularly on difficult points where one most needs help. There's a lot of politics involved, authors often chosen for who they know, not what they know. "Those who can, program; those who can't, write books" Having said that I love such books, after I've studied it myself. Then I always pick up a few valuable points. In fact I hope/expect you will come up with such points from this book: no doubt it has mistakes but also good stuff I've missed.

I recommend "64-ia-32 arcitectures software developers instruction set reference manual" from Intel. It's full of unimportant mistakes, such as missing index references, some spelling mistakes, and horrible instructional material: obviously written by a programmer / techie not a teacher. But there's not one single technical mistake (that I've found)! It's perfectly accurate. Also complete: nothing missing, altho often u have to dig to find it. It's the only book I've used for this topic: pick of the litter, IMHO.
Title: Re: Trigonometry ...
Post by: jj2007 on April 16, 2015, 08:11:57 AM
But the source code can mix the "code paths" as long as these decisions are made at compile time not run time. That's what my macros do.

That implies, however, that you have to provide your customers with different executables by CPU, as code written for modern CPUs will crash on older ones.
Title: Re: Trigonometry ...
Post by: nidud on April 16, 2015, 08:46:55 AM
Quote
libm/ --- libm.so, libm.a
  The math library. Traditionally Unix systems kept stuff like sin(3) and
  cos(3) in a separate library to save space in the days before shared
  libraries.
libdl/ --- libdl.so
  The dynamic linker interface library. This is actually just a bunch of
  stubs that the dynamic linker replaces with pointers to its own
  implementation at runtime.
Title: Re: Trigonometry ...
Post by: rrr314159 on April 16, 2015, 09:58:15 AM
Quote from: jj2007
That implies, however, that you have to provide your customers with different executables by CPU, as code written for modern CPUs will crash on older ones.

- Well that's not what I meant to imply! I have done this successfully, BTW, on a small scale: provided progs to family / friends (not general public). The exe includes all code, FPU up to AVX. Before hitting the advanced code I check CPUID to see if it's supported, if not, I don't allow that code path to be executed. If (for instance) AVX is supported I default to that but still allow selection of "lesser" paths, sometimes there's good reason. For instance AMD's sometimes do the older algo (SSE, even FPU) faster than AVX. That can also happen on older machines where the AVX (or, SSE) implementation does work but it sucks and is very slow. Anyway, if the CPU doesn't support AVX (for instance) it's perfectly OK for the .exe to have those advanced instructions as long as the prog does NOT execute them. This approach has worked fine.

@nidud: I don't really believe in DLL's, don't use them, my .exe links to static library; all code is included all the time. At the most it gets up to 200K (usually much less). Even 15-year old machines have so much memory that this is no problem. I never use any math libraries, everything is written from scratch. Well, sometimes I borrow/steal source code from wherever I can get it, but re-compile it myself. Of course I also use kernel32, user32 as necessary, in the normal way.

My progs only run on Windows, they're all sort of special-purpose (you could call them "math-based games"). For one example, I never use Unicode. So, I just don't run into the problems you do when writing general-purpose stuff for a wide audience.

[edit] reading this over seems I might be sort of exaggerating my experience - I've only had 3 programs used by a few people in the last few months. And all except one of them did it just to make me happy, don't give a **** about my math games! Let's face it, a complete amateur. Bill Gates doesn't have to worry about competition from me anytime soon. Still, I think my progs are pretty cool - that's something.
Title: Re: Trigonometry ...
Post by: Raistlin on April 16, 2015, 03:54:24 PM
I've been reading these posts with complete facination and awe <-- that 's a good thing incase you were wondering.

With a vested interest in what's been discussed I'd like to ask the following questions
for clarrification - please feel free to ignore these if you feel they are stupid:

1) With LUT (look up tables) of varying size - is there a method by which we could
    ensure the CPU identified has the maximum chance of caching the table directly
    so we could reduce the number of main memory accesses. Therefore could we
    programmatically (dynamically) ensure the lookup table beieng generated (precision dependant)
    would fall nicely into CPU cache?
2) When creating real time applications that have an engineering/scientific focus what would be recommend
    precision - without sacrificing cycles - the best-practise balanced approach?
3) Does SSE/FPU use; preclude or have large impacts in speed in theoretical
    parrallel/congurrent environments that you could highlight?

Thanks
R
Title: Re: Trigonometry ...
Post by: rrr314159 on April 17, 2015, 05:05:38 PM
Hello Raistlin,

good to hear you're interested, I get as much satisfaction from sharing my efforts as u do from reading them so it's an even trade

Questions are sensible but others have more experience than me, here's my take on them,

1) I wish we could control cache programatically, or at least a portion of it, but can't; CPU follows its own logic. The way I use a LUT, which is probably same as you and others, it's called within a stream of commands which access other portions of memory such as Video. Those other areas often don't belong in cache because won't be used again, so I would rather retain the LUT; instead it will usually be kicked out. So when the instruction sequence gets back to LUT it often will have to be reloaded. Now, I could make those other memory accesses non-temporal to avoid this but that doesn't necessarily work and is a lot of trouble so I don't. I'm typically accessing sin/cos as much as 250 million per second, would be great if I could do all those together but that's also out of the question. 16384 LUT doesn't fit in my L1 cache anyway, altho it is probably retained in L3 cache (6 MB on my machine). There are a lot of issues - answer is "Yes there are methods by which we can maximize LUT cache usage" but nothing clear or simple, so I don't really try. That's one reason I like the Power Series approach instead. Someone else may have better insight.

2) Engineering / Scientific precision req'ts vary of course depending on many factors but I would say, usually they need more precision than the 7 or so digits I've been concentrating on (I'm doing a graphics app). Scientific wants, I would say, 9; and more is better. Again LUT not 2 satisfactory, need very large table: Chebyshev Power Series are very good in this case. siekmanski posted a set of 5 coeff's for 9 digits; the exact same algos and same techniques we've been posting can be as precise as desired, just find larger sets of 6, 7 .. coeff's. If u don't know how to put them in algo I'll be happy to do it, very straightforward. Gunther works with CERN so his opinion would be valued on this question, I haven't done physics software in many years.

3) SSE and AVX do vector operations, parallel comp of 4 or 8 sin/cos at once. I run these algos on all 4 cores of my machines (Intel i5 and AMD A6), works great - has very large favorable impact! In the old days I left my computer running overnight to accomplish what today takes 1 second! FPU also works fine on multiple cores but it's not vector, only one at a time (altho it does give max possible precision). Not sure if that answers your q .. ?

Thanks again for your interest,
Title: Re: Trigonometry ...
Post by: jj2007 on September 04, 2015, 07:22:25 AM
And this one is our goldmine i think,
http://lolengine.net/wiki/doc/maths/remez

That guy has C++ source code to calculate Sin, Cos, Tan and Exp coeffs at extreme precision.
Those are of my interest ( for my mathlib ), unfortunatly i don't have a C++ compiler to make use of it.....
Maybe you can calculate the coeffs for us ?

This is where i found the coeffs for the 9th order polynomial.

There is even a faster one with 4 coeffs,

Code: [Select]
double fastsin2(double x)
{
    const double a3 = -1.666665709650470145824129400050267289858e-1;
    const double a5 = 8.333017291562218127986291618761571373087e-3;
    const double a7 = -1.980661520135080504411629636078917643846e-4;
    const double a9 = 2.600054767890361277123254766503271638682e-6;

    return x + x*x*x * (a3 + x*x * (a5 + x*x * (a7 + x*x * a9))));
}

Hi Marinus,
Here is a test with the fast algo:
Code: [Select]
Intel(R) Core(TM) i5-2450M CPU @ 2.50GHz
3313 ms for sin
341 ms for sin2

3175 ms for sin
351 ms for sin2

3207 ms for sin
349 ms for sin2

3157 ms for sin
347 ms for sin2

Roughly a factor 9 on the FPU sinus, not bad :t
Title: Re: Trigonometry ...
Post by: HSE on September 04, 2015, 12:05:41 PM
Code: [Select]
AMD A6-3500 APU with Radeon(tm) HD Graphics
3209 ms for sin
864 ms for sin2

3251 ms for sin
857 ms for sin2

3238 ms for sin
870 ms for sin2

3145 ms for sin
853 ms for sin2

Very interesting, but here only a factor 4.
Title: Re: Trigonometry ...
Post by: Siekmanski on September 11, 2015, 10:17:53 PM
Same algo, 4 single precision values at once  :biggrin:

Code: [Select]
.data
align 16

OneDivPi    real4 4 dup (0.31830988618379067153776752674)
Pi          real4 4 dup (3.14159265358979323846264338327)
ChebyChf40  real4 4 dup (-0.16666657096504701458241294000)
ChebyChf41  real4 4 dup (0.00833301729156221812798629161)
ChebyChf42  real4 4 dup (-0.00019806615201350805044116296)
ChebyChf43  real4 4 dup (0.00000260005476789036127712325)

.code

align 16
SSE2sinSPf proc                             ; Calculate Sine of 1 single or 2,3,4 packed single floating point values ( in: xmm0, result: xmm0 )

    mulps       xmm0,oword ptr OneDivPi     ; 1/pi to get a 1pi range
    cvtps2dq    xmm3,xmm0                   ; (4 packed spfp to 4 packed int32) lose the fractional parts and keep it in xmm3 to save the signs
    cvtdq2ps    xmm1,xmm3                   ; (4 packed int32 to 4 packed spfp) save the  integral parts
    subps       xmm0,xmm1                   ; now it's inside the range, results are values between -0.5 to 0.4999999
    pslld       xmm3,31                     ; put sign-bits in position, to place values in the right hemispheres
    xorps       xmm0,xmm3                   ; set sign-bits
    mulps       xmm0,oword ptr Pi           ; restore ranges between -1/2 pi to +1/2 pi

; Now do the Chebyshev approximation of a 9th degree polynomial
; With 4 optimized constants for a maximum error of about 3.3381e-9 over -1/2 pi to +1/2 pi

    movaps      xmm2,xmm0
    mulps       xmm2,xmm2

    movaps      xmm1,oword ptr ChebyChf43
    mulps       xmm1,xmm2
    addps       xmm1,oword ptr ChebyChf42
    mulps       xmm1,xmm2
    addps       xmm1,oword ptr ChebyChf41
    mulps       xmm1,xmm2
    addps       xmm1,oword ptr ChebyChf40
    mulps       xmm2,xmm0
    mulps       xmm1,xmm2
    addps       xmm0,xmm1
    ret

SSE2sinSPf endp

And 2 double precision values at once:
 
Code: [Select]
.data

align 16
OneDivPiDP  real8 2 dup (0.31830988618379067153776752674)
PiDP        real8 2 dup (3.14159265358979323846264338327)

ChebyChf0   real8 2 dup (-0.16666657096504701458241294000)
ChebyChf1   real8 2 dup (0.00833301729156221812798629161)
ChebyChf2   real8 2 dup (-0.00019806615201350805044116296)
ChebyChf3   real8 2 dup (0.00000260005476789036127712325)

.code

align 16
SSE2sinDPf proc                             ; Calculate Sine of 1 double or 2 packed double floating point values ( in: xmm0, result: xmm0 )
     
    mulpd       xmm0,oword ptr OneDivPiDP   ; 1/pi to get a 1pi range
    cvtpd2dq    xmm3,xmm0                   ; (2 packed dpfp to 2 int32) lose the fractional parts and keep it in xmm3 to save the signs
    cvtdq2pd    xmm1,xmm3                   ; (2 packed int32 to 2 dpfp) save the  integral parts
    subpd       xmm0,xmm1                   ; now it's inside the range, results are values between -0.5 to 0.4999999
    pslld       xmm3,31                     ; put sign-bits in position, to place values in the right hemispheres
    pshufd      xmm3,xmm3,Shuffle(1,3,0,2)  ; shuffle the sign-bits into place         
    xorpd       xmm0,xmm3                   ; set sign-bits
    mulpd       xmm0,oword ptr PiDP         ; restore ranges between -1/2 pi to +1/2 pi

; Now do the Chebyshev approximation of a 9th degree polynomial
; With 4 optimized constants for a maximum error of about 3.3381e-9 over -1/2 pi to +1/2 pi

    movapd      xmm2,xmm0
    mulpd       xmm2,xmm2

    movapd      xmm1,oword ptr ChebyChf3
    mulpd       xmm1,xmm2
    addpd       xmm1,oword ptr ChebyChf2
    mulpd       xmm1,xmm2
    addpd       xmm1,oword ptr ChebyChf1
    mulpd       xmm1,xmm2
    addpd       xmm1,oword ptr ChebyChf0
    mulpd       xmm2,xmm0
    mulpd       xmm1,xmm2
    addpd       xmm0,xmm1
    ret
SSE2sinDPf endp