Hello Raistlin,
good to hear you're interested, I get as much satisfaction from sharing my efforts as u do from reading them so it's an even trade
Questions are sensible but others have more experience than me, here's my take on them,
1) I wish we could control cache programatically, or at least a portion of it, but can't; CPU follows its own logic. The way I use a LUT, which is probably same as you and others, it's called within a stream of commands which access other portions of memory such as Video. Those other areas often don't belong in cache because won't be used again, so I would rather retain the LUT; instead it will usually be kicked out. So when the instruction sequence gets back to LUT it often will have to be reloaded. Now, I could make those other memory accesses non-temporal to avoid this but that doesn't necessarily work and is a lot of trouble so I don't. I'm typically accessing sin/cos as much as 250 million per second, would be great if I could do all those together but that's also out of the question. 16384 LUT doesn't fit in my L1 cache anyway, altho it is probably retained in L3 cache (6 MB on my machine). There are a lot of issues - answer is "Yes there are methods by which we can maximize LUT cache usage" but nothing clear or simple, so I don't really try. That's one reason I like the Power Series approach instead. Someone else may have better insight.
2) Engineering / Scientific precision req'ts vary of course depending on many factors but I would say, usually they need more precision than the 7 or so digits I've been concentrating on (I'm doing a graphics app). Scientific wants, I would say, 9; and more is better. Again LUT not 2 satisfactory, need very large table: Chebyshev Power Series are very good in this case. siekmanski posted a set of 5 coeff's for 9 digits; the exact same algos and same techniques we've been posting can be as precise as desired, just find larger sets of 6, 7 .. coeff's. If u don't know how to put them in algo I'll be happy to do it, very straightforward. Gunther works with CERN so his opinion would be valued on this question, I haven't done physics software in many years.
3) SSE and AVX do vector operations, parallel comp of 4 or 8 sin/cos at once. I run these algos on all 4 cores of my machines (Intel i5 and AMD A6), works great - has very large favorable impact! In the old days I left my computer running overnight to accomplish what today takes 1 second! FPU also works fine on multiple cores but it's not vector, only one at a time (altho it does give max possible precision). Not sure if that answers your q .. ?
Thanks again for your interest,