Print Page - Trigonometry ...

Title: Trigonometry ...
Post by: rrr314159 on March 26, 2015, 04:48:39 PM

These routines replace fsin and fcos using the FPU. They're based on Taylor Series for sin and cos. I had been using sin/cos lookup tables (4096 entries) which are a bit faster, but for more precision the LUT gets very large. But the main reason I developed these is because I'm translating all my algos to SSE, and would need the gather instruction (AVX2) to use LUTs. Even if I had them on my machine, most people don't.

I have various requirements for trig routines:

- min precision about 4 decimal digits but capable of (much) higher when a flag is set.
- use no more than 4 FPU registers preferably (could be waived if worth it).
- at least 3 times faster than FPU.
- SIMD - izable, which rules out some techniques.

These two routines, trigC and trigS, are based on Cos and Sin taylor series respectively. trigC is faster but requires one FPU reg per T.S. term (beyond the first "1"). With 4 regs (last term is x^8) it achieves only 5 significant decimal digits. trigS OTOH uses only 4 regs no matter how many terms. This version, at higher precision setting, uses 6 terms, up to x^11, to achieve almost 9 digits. Both routines calculate both sin and cos, with different accuracy patterns (trigS pattern is better, but that gets into too much detail - if interested say so).

On the Intel i5 speed is satisfactory, varying from 4 times faster than FPU to more than 9: only 3 nanoseconds per iteration, with 4 digits precision, enough for some purposes.

However on the AMD they're unsatisfactory. First, they're 3 times slower - rather worse than usual. But the amazing thing is, AMD FPU sin and cos are almost as fast as Intel! Since clock speed is just a little more than half, per cycle AMD is much better.

BTW I've heard it said there's no big difference between Intel and AMD - that's not my experience.

Anyway, I want faster and better routines if possible. These two can be sped up maybe 10% in obvious ways - some of which, no doubt, I'm overlooking - so let me know of any you notice. But also, I'm wondering if there are any tricks that would really make a difference. Some ideas,

- I wonder if fixed point would help. I doubt it.
- By factoring the series you can eliminate one multiply, don't think it's worth it.
- Manipulating the quadrants flag you can eliminate one branch, doesn't seem to help though, the extra memory access kills the advantage.
- Mixing the two Taylor Series in the obvious way - I'm currently investigating that
- I did look on the net, saw various libraries, but it's a lot easier (and almost always better) to just write my own than hassle with them. So please don't just give me a link to a library unless u have reason to think it's worth it.

If you have a candidate routine use my test bed to get stats, or give it to me and I'll do it.

Here is the fastest version I have at the moment, called trigC:

The MASM Forum

General => The Laboratory => Topic started by: rrr314159 on March 26, 2015, 04:48:39 PM