Trigonometry ...

Siekmanski · April 01, 2015, 01:52:21 AM

Here are the timings of the routines.

SSE2 Sin Cos Tan routines by Siekmanski 2015.

fsin:    0.38268344 ( FPU )
Sin SP:  0.38260040 ( SSE2 Single point )
Sin DP:  0.38260039 ( SSE2 Double point )

fcos:    0.92387953 ( FPU )
Cos SP:  0.92384970 ( SSE2 Single point )
Cos DP:  0.92384972 ( SSE2 Double point )

ftan:    0.41421356 ( FPU )
Tan SP:  0.41410828 ( SSE2 Single point )
Tan DP:  0.41410826 ( SSE2 Double point )

Start timing, please wait for 5 seconds......

183  cycles SSE2_SinSP
186  cycles SSE2_SinDP
7163 cycles fsin (FPU)

194  cycles SSE2_CosSP
197  cycles SSE2_CosDP
7163 cycles fcos (FPU)

191  cycles SSE2_TanSP
191  cycles SSE2_TanDP
7385 cycles ftan (FPU)

Gunther · April 01, 2015, 02:56:02 AM

Marinus,

Quote from: Siekmanski on March 31, 2015, 11:02:19 PM
You mean the zip from Reply #42 ?

Yes the EXE from the post #42.

Gunther

Siekmanski · April 01, 2015, 05:13:51 AM

nothing to kill in that zip

rrr314159 · April 01, 2015, 09:37:50 AM

@gunther, thanks for running my test, I'm afraid the timing numbers are not too useful, being nonstandard. That's why I decided to borrow siekmanski's timer to get a valid comparison,

hello siekmanski,

I hate to say this ...

I used your timing routine to check out my FPU version of your LUT, and it comes out almost twice as fast, plus an extra 1/2 digit accuracy. I also checked my Taylor Series routine, which is also faster, with an extra digit (at least) precision! Really surprised me - seems this thread was a success. The Taylor Series takes less than twice the FPU LUT time, much better than I expected.

Note, my routines are Macros while yours are procs, so that saves a few cycles by not making the CALL.

I didn't set the rounding mode to truncate, so as not to disturb the test environment. My routines are same for positive numbers (as used here), but would be off for negative. I did check them with that mode, everything comes out exactly the same, since your routines also are unaffected for pos numbers. The truncate routine is included, but the call is commented out.

It makes sense that FPU would be faster, and more accurate, for various reasons, but the bottom line seems to be (as I've mentioned many times) SSE is disappointing. Probably should only be used when SIMD capability gives a real advantage.

The above applies to Intel. For AMD the FPU LUT is still almost twice as fast as SSE LUT, but the rest of the story is different. AMD fsin and fcos are almost 10 times faster than Intel! But the Taylor Series comes out a bit slower than SSE LUT.

Here's my precision/timing runs:

Code Select



Intel i5 3330 3 Ghz

SSE2 Sin Cos routines by Siekmanski 2015.
Also with LUT and Taylor Series Sin Cos by rrr314159 2015.

fsin:    0.38268344 ( FPU )
Sin SP:  0.38260040 ( SSE2 Single point )
rrrLUT:  0.38268897 ( trigLUT Sin )
Taylor:  0.38268344 ( trigS Sin )

fcos:    0.92387953 ( FPU )
Cos SP:  0.92384970 ( SSE2 Single point )
rrrLUT:  0.92388642 ( trigLUT Cos )
Taylor:  0.92386763 ( trigS Cos )

Start timing, please wait for 5 seconds......

174  cycles SSE2_SinSP
77  cycles trigLUT Sin
144  cycles trigS Sin
6714 cycles fsin (FPU)

181  cycles SSE2_CosSP
91  cycles trigLUT Cos
160  cycles trigS Cos
6697 cycles fcos (FPU)

Press any key to continue...

*********************************************
AMD A6 1.8 Ghz

SSE2 Sin Cos routines by Siekmanski 2015.
Also with LUT and Taylor Series Sin Cos by rrr314159 2015.

fsin:    0.38268344 ( FPU )
Sin SP:  0.38260040 ( SSE2 Single point )
rrrLUT:  0.38268897 ( trigLUT Sin )
Taylor:  0.38268344 ( trigS Sin )

fcos:    0.92387953 ( FPU )
Cos SP:  0.92384970 ( SSE2 Single point )
rrrLUT:  0.92388642 ( trigLUT Cos )
Taylor:  0.92386763 ( trigS Cos )

Start timing, please wait for 5 seconds......

214  cycles SSE2_SinSP
110  cycles trigLUT Sin
265  cycles trigS Sin
874 cycles fsin (FPU)

221  cycles SSE2_CosSP
131  cycles trigLUT Cos
287  cycles trigS Cos
906 cycles fcos (FPU)

Press any key to continue...

The zip contains your prog, slightly modified to include my trigLUT routine, also stripped out all Tan and Double Precision.

Gunther · April 01, 2015, 09:51:14 AM

Marinus,

here are the new timings:

Code Select


SSE2 Sin Cos Tan routines by Siekmanski 2015.

fsin:    0.38268344 ( FPU )
Sin SP:  0.38260040 ( SSE2 Single point )
Sin DP:  0.38260039 ( SSE2 Double point )

fcos:    0.92387953 ( FPU )
Cos SP:  0.92384970 ( SSE2 Single point )
Cos DP:  0.92384972 ( SSE2 Double point )

ftan:    0.41421356 ( FPU )
Tan SP:  0.41410828 ( SSE2 Single point )
Tan DP:  0.41410826 ( SSE2 Double point )

Start timing, please wait for 5 seconds......

167  cycles SSE2_SinSP
168  cycles SSE2_SinDP
6279 cycles fsin (FPU)

170  cycles SSE2_CosSP
173  cycles SSE2_CosDP
6280 cycles fcos (FPU)

168  cycles SSE2_TanSP
168  cycles SSE2_TanDP
6468 cycles ftan (FPU)


Press any key to continue...

and the following are for rrr:

Code Select


SSE2 Sin Cos routines by Siekmanski 2015.
Also with LUT and Taylor Series Sin Cos by rrr314159 2015.

fsin:    0.38268344 ( FPU )
Sin SP:  0.38260040 ( SSE2 Single point )
rrrLUT:  0.38268897 ( trigLUT Sin )
Taylor:  0.38268344 ( trigS Sin )

fcos:    0.92387953 ( FPU )
Cos SP:  0.92384970 ( SSE2 Single point )
rrrLUT:  0.92388642 ( trigLUT Cos )
Taylor:  0.92386763 ( trigS Cos )

Start timing, please wait for 5 seconds......

168  cycles SSE2_SinSP
75  cycles trigLUT Sin
138  cycles trigS Sin
6390 cycles fsin (FPU)

174  cycles SSE2_CosSP
88  cycles trigLUT Cos
154  cycles trigS Cos
6395 cycles fcos (FPU)

Press any key to continue...

Gunther

Siekmanski · April 01, 2015, 10:08:37 AM

Thanks Gunther.

Wow ! rrr314159, this is really impressive !

I'll study your source code tomorrow, this thread is indeed a success. :t

Code Select

SSE2 Sin Cos routines by Siekmanski 2015.
Also with LUT and Taylor Series Sin Cos by rrr314159 2015.

fsin:    0.38268344 ( FPU )
Sin SP:  0.38260040 ( SSE2 Single point )
rrrLUT:  0.38268897 ( trigLUT Sin )
Taylor:  0.38268344 ( trigS Sin )

fcos:    0.92387953 ( FPU )
Cos SP:  0.92384970 ( SSE2 Single point )
rrrLUT:  0.92388642 ( trigLUT Cos )
Taylor:  0.92386763 ( trigS Cos )

Start timing, please wait for 5 seconds......

184  cycles SSE2_SinSP
76  cycles trigLUT Sin
154  cycles trigS Sin
7130 cycles fsin (FPU)

193  cycles SSE2_CosSP
97  cycles trigLUT Cos
177  cycles trigS Cos
7138 cycles fcos (FPU)


Press any key to continue...

dedndave · April 01, 2015, 12:50:10 PM

you guys are making me feel old :lol:

P4 prescott w/htt @ 3.0 GHz

Code Select

SSE2 Sin Cos routines by Siekmanski 2015.
Also with LUT and Taylor Series Sin Cos by rrr314159 2015.

fsin:    0.38268344 ( FPU )
Sin SP:  0.38260040 ( SSE2 Single point )
rrrLUT:  0.38268897 ( trigLUT Sin )
Taylor:  0.38268344 ( trigS Sin )

fcos:    0.92387953 ( FPU )
Cos SP:  0.92384970 ( SSE2 Single point )
rrrLUT:  0.92388642 ( trigLUT Cos )
Taylor:  0.92386763 ( trigS Cos )

Start timing, please wait for 5 seconds......

580  cycles SSE2_SinSP
404  cycles trigLUT Sin
621  cycles trigS Sin
22774 cycles fsin (FPU)

673  cycles SSE2_CosSP
431  cycles trigLUT Cos
668  cycles trigS Cos
22701 cycles fcos (FPU)

very nice work, guys :t

Gunther · April 01, 2015, 01:07:13 PM

Quote from: dedndave on April 01, 2015, 12:50:10 PM
very nice work, guys :t

Yes, indeed. :t My deference.

Gunther

nidud · April 02, 2015, 12:21:49 AM

deleted

rrr314159 · April 02, 2015, 12:52:46 AM

Looks like it would improve performance on other computers, will give it a try in a while, but some AMD's are strange. Notice the FPU time is almost same as the other routines, but on Intel it's 40 times greater! My AMD A6 is similar. But some AMD's, like sinsi's (I think more modern / powerful) behave like Intel

jj2007 · April 02, 2015, 02:52:38 AM

Just found this by accident here:

QuoteFirst, the Taylor series is NOT the best/fastest way to implement sine/cos. It is also not the way professional libraries implement these trigonometric functions
...
Check gsl gnu scientific library (or numerical recipes) implementation, and you will see that they basically use Chebyshev Approximation Formula. Chebyshev approximation converges faster, so you need to evaluate fewer terms.

rrr314159 · April 02, 2015, 06:45:51 AM

@Gunther, dedndave, thanks for the congrats but we're not done yet ... main goal was fast SSE power Series routine. It's coming along, hope to post something soon ...

@jj2007, Chebyshev might be good idea. If you look at the OP you'll see I required a range of precisions and figured Chebyshev was n.g. because the coeff's change depending on the desired precision (unless u want to calc the orthogonal poly's which takes 2 long). But now that u mention it, we can just have entirely different series for different precisions - in the old days we couldn't afford to waste the space. Also remembered the gain as minimal, and the region of convergence problematical, but according to the post u reference it's substantial over our region of interest. Well, easy enuff to check it out - change 3 or 4 constants. I'll let u know how it works out. Thanks !

Doesn't anyone else think it strange that some AMD's have fcos and fsin that's 10 times faster (in cycles) than Intel? Why is that? Perhaps, they use a lot of transistors, Intel didn't think it worthwhile ...? If they were all like nidud's (and my) AMD would prob be better to just use built-in functions.

rrr314159 · April 02, 2015, 07:17:17 AM

Took a minute to find out - yes, Chebyshev is worth it, especially for medium precision (6-7 digits), it gets at least 1 extra digit of precision. Takes a little longer for currently unknown reason. For higher precisions it's more trouble, zone of convergence shrinks rapidly. So for adjustable-precision algo it's a bit of a PITA, but, worth it. Thanks jj! Aren't u glad you're not wasting your time, in this thread? In math answers are exact and unarguable (for the most part), which makes it much more satisfying than text, where you have to deal with caches, bibles, opinions and who knows what. Now find a way to improve the LUT

I'll be posting an update with these better coefficients when I get the SSE SIMD algo working which should be soon

Quote from: nidudAs rrr pointed out it may be a good idea to either convert all algos to macros, or (maybe better) convert all to proc's.

- I converted them all to macro's because it's faster, admittedly proc's are more portable ... well, at least now we're comparing apples to apples

Siekmanski · April 02, 2015, 10:31:27 PM

Hi rrr314159,
Studied your mega fast LUT routine.
Maybe you noticed already that your LUT routine crashes sometimes or gives wrong results.

Hi nidud,
The reason i used a jump table was to leverage out the time differences between the 4 quadrant calculations. But with only 4 entries it's not paying off, thus removed the jump table.

Improved the routine a little bit. It's more accurate now.

Code Select

SSE2 Sin Cos routines by Siekmanski 2015.
Also with LUT and Taylor Series Sin Cos by rrr314159 2015.

fsin:    -0.93832556 ( FPU )
Sin SP:  -0.93831056 ( SSE2 Single point )
rrrLUT:  -0.93834370 ( trigLUT Sin )
Taylor:  -0.93830954 ( trigS Sin )

fcos:    0.34575302 ( FPU )
Cos SP:  0.34572631 ( SSE2 Single point )
rrrLUT:  0.34572631 ( trigLUT Cos )
Taylor:  0.34575302 ( trigS Cos )

Start timing, please wait for 5 seconds......

174  cycles SSE2_SinSP
97  cycles trigLUT Sin
186  cycles trigS Sin
7134 cycles fsin (FPU)

180  cycles SSE2_CosSP
86  cycles trigLUT Cos
170  cycles trigS Cos
7133 cycles fcos (FPU)

rrr314159 · April 03, 2015, 03:02:28 AM

Quote from: siekmanskiMaybe you noticed already that your LUT routine crashes sometimes or gives wrong results.

... It's the problem I've mentioned often, all my routines use "truncate down to -inf" rounding mode, but everybody else uses the default rounding mode, and that's what you're doing here. So the Taylor Series should also give wrong results for negative numbers, some other borderline cases too, where it rounds in the wrong direction. But it doesn't crash because it's not accessing mem incorrectly. The LUT can try to access table entry "16385", which is why an extra entry at the top (1.0) is safer. If you comment IN the call to initFPUsettruncatemode it may fix it entirely. I'll check it out.

Been putting off this rounding mode issue, but have to deal with it. "Math-style" rounding is better (at least one less instruction needed to determine quadrant, and more intuitive) but it's non-standard and I have to bite the bullet and convert to the default mode. No big deal, but hate to do it since all my algos (thousands of lines) use math int function. I intend to continue using it for my own work (unless I find out it's not, as I believe, better), so I'll need 2 versions forever ... a PITA. But can't expect the rest of the world to conform to me!

Meanwhile I've made a new test bed which feeds the algos SSE registers instead of FPU (for SIMD algo) and, naturally, your LUT does better there since it needs no conversion, as it does here. Still not as fast on Intel but on my AMD it's about the same. ... And the FPU sin and cos are 10X faster than Intel! All these routines would be more or less redundant, if everyone had AMD's like mine and nidud's.

The MASM Forum

News:

Trigonometry ...

Siekmanski

Gunther

Siekmanski

rrr314159

Gunther

Siekmanski

dedndave

Gunther

nidud

rrr314159

jj2007

rrr314159

rrr314159

Siekmanski

rrr314159