Thanks Gunther, TWell, siekmanski, farabi,

Gunther I can see why you figure AVX is worth it you get great improvement over SSE, more than as twice as fast in some cases! ... but of course these timings can't be entirely trusted. farabi is right to approve your RDTSCP timing recommendation, also using DOS would help. Anyway these numbers are good enuff to give the basic idea, AVX is certainly faster.

When I ask "is it worth it?" however I'm not only thinking of the time improvement, or even the extra work figuring out AVX, I'm mainly worried about maintenance. AVX adds a bunch of complicated instructions; when bugs crop up, they will prob be found there. Also many computers can't use AVX so need to include the CPUID test. In future one must add AVX2 and AVX512 ... it's tempting to just leave it at SSE and figure "good enough" ... but really it's not.

That's why I made these macros, so I can use one algorithm for all 4 instruction sets. And they will easily expand to AVX512 etc in the future. I haven't seen anyone else doing this, maybe there's some gotcha I'm missing? I'm intending to start a new thread concentrating on this "Orthogonal Instruction Set" idea and hope someone has a better way to do it, or can tell me what's wrong with it. I'm making a new algo (Complex Rational Polynomials) to give these macros a good work-out.

TWell, as usual AMD behaves different, AVX improvement is less. My AMD is about the same. Someday I'd like to study AMD peculiarities. They're better than Intel in some ways, worse in others.

siekmanski (referring to your previous post) thanks for examining that algo it really helps to have someone take a close look at it. I looked on net for more Chebyshev coefficients but couldn't find them. The ones I have give 5 and 7 digits precision; as it happens, for my current application, 6 is right, so I was happy with these. But the 5-coeffs to give 9 digits is good to have. If you have more, please post them, I'd particularly like to see coefficients for Cosine? But higher levels would be useful (6, 7 coeffs) also more precise versions of the 3 and 4 coeffs I'm using. But oddly enough, more precise coeffs don't seem to help? If you chop your 5-coeff set down to only 7 significant digits I think it may give about the same precision, 3.3381e-9?

You seem surprised at how good the 5 coeffs are - and yes, they are - but that's characteristic of sin (and cos). Sin is an "odd" function, antisymmetric around 0, so a power series approximation uses only odd powers: x, x^3, x^5, x^7, x^9 etc. So each extra coeff improves the approximation by (on the order of, roughly) 2 digits. So 3 cheby coeffs gives 5 digits, 4 gives 7, and 5 coeffs is up to 9 digits precision. (Whereas if a function is not odd - or even, like Cosine - each extra coeff only adds one power of x, roughly 1 digit instead of 2.) To get all the way to 17 (like fsin) would take 9 coeffs. This is just a naive estimate ignoring various factors but probably more or less correct. It seems that using this technique we can beat fsin by a large margin at the same precision which makes a real mystery, why Intel fsin isn't a lot faster? Of course AMD is; presumably they're using this cheby power series or something similar. As far as I can see Intel screwed up, but there must be more to the story than that.

I looked into computing cheby coeff's but if you have a source for them pls post them save a bunch of math. (Which of course is easy if you're on top of it, but it's been 40 years since I was). BTW I think I may be able to improve on Chebyshev, for our specific purposes, quite a bit ... let you know if so.

As I say I'm glad you studied the algo but I think you've made a mistake? I'm using

` pslld xmm2, 31`

pshufd xmm2, xmm2, 73h ;; convert xmm2 int's to neg masks

and u thought you could improve that by

`psllq xmm2, 63 ; put sign-bits in position, to place values in the right hemispheres`

on the face of it that seems good but no, it doesn't work? Because the two dwords are packed into the lower 64 bits. If they were in 0..31 and 63..95 it would work but they're not. One of us is missing something?

thanks all,