Trigonometry ...

rrr314159 · March 29, 2015, 04:24:01 AM

It's a challenge; one way or another you'll have to take a speed hit. When you multiply by 65535 /2pi the number easily becomes too large for real4. Normally one would mult by 1/2pi - or, actually, 2/pi, to determine the quadrant. Here's how Jean-Michel Muller does it (slightly edited),

Code Select

mulsd xmm0,sinsd_2_pi    ; mul by 2/pi
        cvttsd2si eax,xmm3           ; convert to real8
        cvtsi2sd xmm1,eax            ; convert to integer
        mov edx,eax                     ; save integer in edx 
        and edx,3                          ; and determine the quadrant (more or less)
        mulsd xmm1,sinsd_cc0       ; mult the integer back up by pi/2
        subsd xmm0,xmm1            ; subtract from original to get remainder, between 0 and pi/2

He has two advantages, 1) using real8 handles much larger numbers, 2) he doesn't really care about speed. This whole procedure costs him maybe 5%, but it would cost you 50%. So what you want is to do it with integers ... but as long as u mult by 65536/2pi and stick with real4, u can't go much over 1000 or so without losing precision.

Well, it's a challenge!

Siekmanski · March 29, 2015, 08:54:20 AM

This is how i handled the larger numbers for a 16384 LUT,

Code Select

SSE_Sin proc
    movss       xmm1,xmm0
    mulss       xmm1,FLT4(0.15915494309189533576888376337)  ; 1 / PI2
    cvttss2si   eax,xmm1    
    cvtsi2ss    xmm1, eax
    mulss       xmm1,FLT4(6.28318530717958647692528676656)  ; PI2
    subss       xmm0, xmm1
    mulss       xmm0,FLT4(10430.2191955273608296137974326)  ; 65535 / PI2 
    cvttss2si   eax,xmm0
    and         eax,65535
    mov         edx,eax
    shr         edx,14
    jmp         [QuadrantJump+edx*4]
SSE_Sin endp

The timings:

Code Select

FPU and SSE Sin Cos routine timings.

Please wait for 5 seconds......

192 cycles SSE_Sin

193 cycles SSE_Cos

Now the FPU sin and cos timings
This may take a while....

7130 cycles fsin
7130 cycles fcos

Press any key to continue...

Edit: new version see Reply #42

jj2007 · March 29, 2015, 12:53:51 PM

Quote from: Siekmanski on March 29, 2015, 08:54:20 AM
This is how i handled the larger numbers for a 16384 LUT

A factor 37 is really impressing :t

Code Select

164 cycles SSE_Sin
160 cycles SSE_Cos

Now the FPU sin and cos timings
This may take a while....

5937 cycles fsin
5950 cycles fcos

btw JWasm doesn't like xorps xmm0, real4 ptr NegSP - and it looks indeed a bit fishy.
ML 9.0 encodes it as

Code Select

004011ED          ³.  0F5705 00204000         xorps xmm0, xmmword ptr [402000]       ; float 5.883850e-39, 5.883822e-39, 5.883806e-39, 0.0

rrr314159 · March 29, 2015, 02:47:33 PM

jj2007,

In this case I believe your suspicions are misplaced :)

Those floats, 5.883850e-39 etc, are the addresses of the labels Quadrant1, Quadrant2 etc, which appear right after the NegSP variable in the proc's LOCAL statement. The float 0.0 is actually 80000000h, the "negative mask", and the statement correctly changes the sign bit of the first real4 in xmm0.

JWasm insists that we write "xmmword ptr" whereas ML helpfully fills it in - the opposite of the usual behavior of the two compilers. It's a bit dangerous, actually. Looks like the statement doesn't touch the upper 3 real4's in xmm0, but it does. Could cause hard-to-find bugs.

In this case, considering the usage of xmm0, everything's copacetic. But if the prog were changed later, could cause probs, so it's good to be aware of it

jj2007 · March 29, 2015, 07:20:20 PM

Thanks, I had no time to dive into the code, and just saw JWasm's complain :redface:

Siekmanski · March 29, 2015, 09:13:41 PM

Hi jj,

Quotebtw JWasm doesn't like xorps xmm0, real4 ptr NegSP - and it looks indeed a bit fishy.

Thought to save 12 bytes, but indeed it may be confusing.
Just change these lines,

align 16
NegSP dd 080000000h,0,0,0

xorps xmm0,oword ptr NegSP

Marinus

Siekmanski · March 30, 2015, 09:05:38 AM

Changed the "fishy real4 ptr NegSP" to "non fishy and JWasm proof" in the source code.

New attachment see Reply #31

Gunther · March 30, 2015, 09:28:16 AM

Marinus,

the new timings from post #31 with i7-3770 CPU, 3.40 GHz running Windows 7-64:

Code Select


FPU and SSE Sin Cos routine timings.

Please wait for 5 seconds......

171 cycles SSE_Sin

174 cycles SSE_Cos

Now the FPU sin and cos timings
This may take a while....

6262 cycles fsin
6262 cycles fcos

Press any key to continue...

Gunther

Siekmanski · March 30, 2015, 09:49:50 AM

Thanks :t

Siekmanski · March 31, 2015, 10:06:15 AM

Reviewed the routines ( correction in the Cosine routine ) and think i've got it as precise as possible now for the real4 LUT approach.
Added a Tangent routine.

Hi, rrr
Just curious, did you find any more speed up tricks without using a LUT and or calculating more at once ?

Marinus

Edit: new version see Reply #42

rrr314159 · March 31, 2015, 02:24:15 PM

Hi siekmanski,

glad u asked, I was just about to post this update ...

Here are my current versions of trig routines. There are 2 of them, Taylor Series and LUT.

T.S. - I used a technique from Jean-Michel Muller, from the SSE routine qWord donated, to improve trigS (the sin-T.S. routine.) It gets up to 11 significant digits; or, in reduced precision mode, gets 5-6 digits 9 times faster than FPU.

BTW I've made a SIMD version of that Muller algo (much modified), sped it up somewhat, but it's still awfully slow. Will have something to show one of these days, for now both these 2 routines are clearly superior, in spite of doing only one input at a time.

The previously posted trigC, based on cos T.S., is still pretty good as is, not re-posted here. This one is prob a bit better.

LUT - siekmanski, your latest version seems as precise as can be done, but I realized why I don't have these problems: I use FPU. Even with real4 data FPU does intermediate calc's to 80 bits so it never loses precision; all the way up to 10^38 gives the same results. So I'm posting my FPU LUT technique, called trigLUT. Using your 16384 entry table, thanks. I suppose it's slower than your lookup routine but for me the extra precision is important. Goes 25 times faster than FPU, not far from 1 nanosecond per number.

One reason for this thread was to find out "what I'm doing wrong". I always use "truncate down to -infinity" rounding mode, compatible with mathematics int function, but everybody else uses the default "round-to-nearest". Although that's much clumsier (to me) I guess it's just as good and has the advantage of being standard. So probably I should convert to that. However haven't done so yet, so these routines still truncate to -inf. A lot easier to understand, isn't it?

Also I should use the standard timers.asm, but again it's not very comfortable to me so I've still got my own routines here. Sorry ... I believe the numbers are reasonably accurate but their main purpose is just relative comparison

Here is the LUT routine:

Code Select

;;*******************************
trigLUT MACRO sincosflag:=<0>   ;; rrr314159 2015/03/30 fpu-based lookup
;;*******************************
    fmul twooverpi              ;; div by pi/2 to put in 1st quadrant
    fld st
    fistp qword ptr [esp-8]     ;; math-like truncate, down towards -inf
    mov eax, dword ptr [esp-8]  ;; eax has (lower word of) int quotient x / pi/2
    fild qword ptr [esp-8]
    fsub                        ;; now mod 1 (0 to .999999) meaning 0-pi/2
    add eax, sincosflag         ;; this converts to cos when sincosflag = 1
    test eax, 1
    jz @F
        fld1
        fsubrp                  ;; replace w/ 1-edx for these quadrants
    @@:
    
    fmul FLT8(16384.0)
    fistp dword ptr [esp-4]     ;; get table index
    mov edx, dword ptr [esp-4]
    fld real4 ptr [sseSinTable+edx*4]

    and eax, 2
    je @F
        fchs                    ;; was in a negative quadrant
    @@:
ENDM
;;*******************************

trigS is longer and messier so please check the zip for that. Here is a typical precision / timing run:

Code Select

FPU average nanos per iter      32.600

=======================================================================
Test Function: trigS ========================================
=======================================================================
********** SIN Using faster version **********
Nanoseconds per Iteration       3.589
Speed ratio FPU/test fn         9.08

Precision
average precision               1.58e-005
worst precision                 1.53e-004

********** COS Using faster version **********
Nanoseconds per Iteration       3.584
Speed ratio FPU/test fn         9.1

Precision
average precision               1.57e-005
worst precision                 1.48e-004

********** SIN Using higher precision **********
Nanoseconds per Iteration       7.389
Speed ratio FPU/test fn         4.41

Precision
average precision               2.93e-011
worst precision                 4.59e-011

********** COS Using higher precision **********
Nanoseconds per Iteration       7.407
Speed ratio FPU/test fn         4.4

Precision
average precision               2.93e-011
worst precision                 4.58e-011

=======================================================================
Test Function: trigLUT ========================================
=======================================================================
********** SIN Using higher precision **********
Nanoseconds per Iteration       1.295
Speed ratio FPU/test fn         25.2

Precision
average precision               1.52e-005
worst precision                 4.41e-005

********** COS Using higher precision **********
Nanoseconds per Iteration       1.306
Speed ratio FPU/test fn         25

Precision
average precision               1.53e-005
worst precision                 4.77e-005

The zip contains

trig32.asm ; the main routines (called 32 because there's also a 64-bit version, BTW)
test_support_macros ; lots of support macros, could stand some cleaning up
trig32.exe
doj32.bat ; "do JWasm 32-bit" compiling batch file, just for reference

[edit] woops, forget to comment out the line "MY_CPU=1" at top of trig32.asm. Just comment that out. Unless you know your own CPU's FPU sin/cos average speed, in which case put it below at "MY_CPU_NANOS". This saves having to recompute it every time you run the prog.

Gunther · March 31, 2015, 09:44:03 PM

Hi rrr,

here are the timings on my machine under Windows 7-64:

Code Select


FPU average nanos per iter      32.600
=======================================================================
Test Function: trigS ========================================
=======================================================================

********** SIN Using faster version **********
Nanoseconds per Iteration       2.778
Speed ratio FPU/test fn         11.7

Precision
average precision               1.58e-005
worst precision                 1.53e-004

********** COS Using faster version **********
Nanoseconds per Iteration       2.789
Speed ratio FPU/test fn         11.7

Precision
average precision               1.57e-005
worst precision                 1.48e-004

********** SIN Using higher precision **********
Nanoseconds per Iteration       5.981
Speed ratio FPU/test fn         5.45

Precision
average precision               2.93e-011
worst precision                 4.59e-011

********** COS Using higher precision **********
Nanoseconds per Iteration       6.000
Speed ratio FPU/test fn         5.43

Precision
average precision               2.93e-011
worst precision                 4.58e-011
=======================================================================
Test Function: trigLUT ========================================
=======================================================================

********** SIN Using higher precision **********
Nanoseconds per Iteration       0.896
Speed ratio FPU/test fn         36.4

Precision
average precision               1.52e-005
worst precision                 4.41e-005

********** COS Using higher precision **********
Nanoseconds per Iteration       0.926
Speed ratio FPU/test fn         35.2

Precision
average precision               1.53e-005
worst precision                 4.77e-005

Gunther

Siekmanski · March 31, 2015, 10:11:42 PM

Improved the larger number issue.
Made a stupid miscalculation in the tangent routine.
New better version, now with single point and double point precision.

Gunther · March 31, 2015, 10:43:18 PM

Hi Marinus,

thank you for uploading. By the way, here is another stupid false positive by Antivira. What the heck.

Gunther

Siekmanski · March 31, 2015, 11:02:19 PM

Hi, Gunther
You mean the zip from Reply #42 ?

The MASM Forum

News:

Trigonometry ...

rrr314159

Siekmanski

jj2007

rrr314159

jj2007

Siekmanski

Siekmanski

Gunther

Siekmanski

Siekmanski

rrr314159

Gunther

Siekmanski

Gunther

Siekmanski