News:

Masm32 SDK description, downloads and other helpful links
Message to All Guests
NB: Posting URL's See here: Posted URL Change

Main Menu

Trigonometry ...

Started by rrr314159, March 26, 2015, 04:48:39 PM

Previous topic - Next topic

Gunther

Hi rrr,

here are your AVX results. I hope that helps.

Sleeping to stabilize timings...------------------------------------
=========================================
REAL8 sin/cos AVX algorithm
=========================================
BLANK LOOP total is     113

Test Function: trig Sin ========================================

Cycles for 100 Calcs    231

Precision (HIGHER)
average precision       3.83e-007
worst precision         6.27e-007

Test Function: trig Cos ========================================

Cycles for 100 Calcs    252

Precision (HIGHER)
average precision       3.83e-007
worst precision         6.27e-007
=========================================
REAL4 sin/cos AVX algorithm
=========================================
BLANK LOOP total is     19

Test Function: trig Sin ========================================

Cycles for 100 Calcs    145

Precision (HIGHER)
average precision       3.96e-007
worst precision         8.34e-007

Test Function: trig Cos ========================================

Cycles for 100 Calcs    154

Precision (HIGHER)
average precision       4.39e-007
worst precision         9.09e-007
=========================================
REAL8 sin/cos SSE algorithm
=========================================
BLANK LOOP total is     118

Test Function: trig Sin ========================================

Cycles for 100 Calcs    460

Precision (HIGHER)
average precision       3.83e-007
worst precision         6.27e-007

Test Function: trig Cos ========================================

Cycles for 100 Calcs    541

Precision (HIGHER)
average precision       3.83e-007
worst precision         6.27e-007
=========================================
REAL4 sin/cos SSE algorithm
=========================================
BLANK LOOP total is     57

Test Function: trig Sin ========================================

Cycles for 100 Calcs    205

Precision (HIGHER)
average precision       3.81e-007
worst precision         8.79e-007

Test Function: trig Cos ========================================

Cycles for 100 Calcs    219

Precision (HIGHER)
average precision       4.01e-007
worst precision         8.05e-007


Gunther
You have to know the facts before you can distort them.

TWell

AMD E1-6010 1,35 GHz

Sleeping to stabilize timings...------------------------------------
=========================================
REAL8 sin/cos AVX algorithm
=========================================
BLANK LOOP total is 106

Test Function: trig Sin ========================================

Cycles for 100 Calcs 987

Precision (HIGHER)
average precision 3.83e-007
worst precision    6.27e-007

Test Function: trig Cos ========================================

Cycles for 100 Calcs 977

Precision (HIGHER)
average precision 3.83e-007
worst precision    6.27e-007
=========================================
REAL4 sin/cos AVX algorithm
=========================================
BLANK LOOP total is 36

Test Function: trig Sin ========================================

Cycles for 100 Calcs 444

Precision (HIGHER)
average precision 3.96e-007
worst precision    8.34e-007

Test Function: trig Cos ========================================

Cycles for 100 Calcs 444

Precision (HIGHER)
average precision 4.39e-007
worst precision    9.09e-007
=========================================
REAL8 sin/cos SSE algorithm
=========================================
BLANK LOOP total is 213

Test Function: trig Sin ========================================

Cycles for 100 Calcs 1166

Precision (HIGHER)
average precision 3.83e-007
worst precision    6.27e-007

Test Function: trig Cos ========================================

Cycles for 100 Calcs 1239

Precision (HIGHER)
average precision 3.83e-007
worst precision    6.27e-007
=========================================
REAL4 sin/cos SSE algorithm
=========================================
BLANK LOOP total is 62

Test Function: trig Sin ========================================

Cycles for 100 Calcs 468

Precision (HIGHER)
average precision 3.81e-007
worst precision    8.79e-007

Test Function: trig Cos ========================================

Cycles for 100 Calcs 506

Precision (HIGHER)
average precision 4.01e-007
worst precision    8.05e-007

Siekmanski

Sleeping to stabilize timings...------------------------------------
=========================================
REAL8 sin/cos AVX algorithm
=========================================
BLANK LOOP total is     96

Test Function: trig Sin ========================================

Cycles for 100 Calcs    320

Precision (HIGHER)
average precision       3.83e-007
worst precision         6.27e-007

Test Function: trig Cos ========================================

Cycles for 100 Calcs    364

Precision (HIGHER)
average precision       3.83e-007
worst precision         6.27e-007
=========================================
REAL4 sin/cos AVX algorithm
=========================================
BLANK LOOP total is     34

Test Function: trig Sin ========================================

Cycles for 100 Calcs    175

Precision (HIGHER)
average precision       3.96e-007
worst precision         8.34e-007

Test Function: trig Cos ========================================

Cycles for 100 Calcs    175

Precision (HIGHER)
average precision       4.39e-007
worst precision         9.09e-007
=========================================
REAL8 sin/cos SSE algorithm
=========================================
BLANK LOOP total is     133

Test Function: trig Sin ========================================

Cycles for 100 Calcs    606

Precision (HIGHER)
average precision       3.83e-007
worst precision         6.27e-007

Test Function: trig Cos ========================================

Cycles for 100 Calcs    663

Precision (HIGHER)
average precision       3.83e-007
worst precision         6.27e-007
=========================================
REAL4 sin/cos SSE algorithm
=========================================
BLANK LOOP total is     63

Test Function: trig Sin ========================================

Cycles for 100 Calcs    252

Precision (HIGHER)
average precision       3.81e-007
worst precision         8.79e-007

Test Function: trig Cos ========================================

Cycles for 100 Calcs    284

Precision (HIGHER)
average precision       4.01e-007
worst precision         8.05e-007
Creative coders use backward thinking techniques as a strategy.

rrr314159

Thanks Gunther, TWell, siekmanski, farabi,

Gunther I can see why you figure AVX is worth it you get great improvement over SSE, more than as twice as fast in some cases! ... but of course these timings can't be entirely trusted. farabi is right to approve your RDTSCP timing recommendation, also using DOS would help. Anyway these numbers are good enuff to give the basic idea, AVX is certainly faster.

When I ask "is it worth it?" however I'm not only thinking of the time improvement, or even the extra work figuring out AVX, I'm mainly worried about maintenance. AVX adds a bunch of complicated instructions; when bugs crop up, they will prob be found there. Also many computers can't use AVX so need to include the CPUID test. In future one must add AVX2 and AVX512 ... it's tempting to just leave it at SSE and figure "good enough" ... but really it's not.

That's why I made these macros, so I can use one algorithm for all 4 instruction sets. And they will easily expand to AVX512 etc in the future. I haven't seen anyone else doing this, maybe there's some gotcha I'm missing? I'm intending to start a new thread concentrating on this "Orthogonal Instruction Set" idea and hope someone has a better way to do it, or can tell me what's wrong with it. I'm making a new algo (Complex Rational Polynomials) to give these macros a good work-out.

TWell, as usual AMD behaves different, AVX improvement is less. My AMD is about the same. Someday I'd like to study AMD peculiarities. They're better than Intel in some ways, worse in others.

siekmanski (referring to your previous post) thanks for examining that algo it really helps to have someone take a close look at it. I looked on net for more Chebyshev coefficients but couldn't find them. The ones I have give 5 and 7 digits precision; as it happens, for my current application, 6 is right, so I was happy with these. But the 5-coeffs to give 9 digits is good to have. If you have more, please post them, I'd particularly like to see coefficients for Cosine? But higher levels would be useful (6, 7 coeffs) also more precise versions of the 3 and 4 coeffs I'm using. But oddly enough, more precise coeffs don't seem to help? If you chop your 5-coeff set down to only 7 significant digits I think it may give about the same precision, 3.3381e-9?

You seem surprised at how good the 5 coeffs are - and yes, they are - but that's characteristic of sin (and cos). Sin is an "odd" function, antisymmetric around 0, so a power series approximation uses only odd powers: x, x^3, x^5, x^7, x^9 etc. So each extra coeff improves the approximation by (on the order of, roughly) 2 digits. So 3 cheby coeffs gives 5 digits, 4 gives 7, and 5 coeffs is up to 9 digits precision. (Whereas if a function is not odd - or even, like Cosine - each extra coeff only adds one power of x, roughly 1 digit instead of 2.) To get all the way to 17 (like fsin) would take 9 coeffs. This is just a naive estimate ignoring various factors but probably more or less correct. It seems that using this technique we can beat fsin by a large margin at the same precision which makes a real mystery, why Intel fsin isn't a lot faster? Of course AMD is; presumably they're using this cheby power series or something similar. As far as I can see Intel screwed up, but there must be more to the story than that.

I looked into computing cheby coeff's but if you have a source for them pls post them save a bunch of math. (Which of course is easy if you're on top of it, but it's been 40 years since I was). BTW I think I may be able to improve on Chebyshev, for our specific purposes, quite a bit ... let you know if so.

As I say I'm glad you studied the algo but I think you've made a mistake? I'm using

    pslld xmm2, 31
    pshufd xmm2, xmm2, 73h         ;; convert xmm2 int's to neg masks


and u thought you could improve that by

psllq xmm2, 63   ; put sign-bits in position, to place values in the right hemispheres

on the face of it that seems good but no, it doesn't work? Because the two dwords are packed into the lower 64 bits. If they were in 0..31 and 63..95 it would work but they're not. One of us is missing something?

thanks all,






I am NaN ;)

Siekmanski

Hi rrr,

QuoteI looked on net for more Chebyshev coefficients but couldn't find them.

I found 2 interesting sites,

Chebyshev polynomial using the Clenshaw algorithm:
http://goddard.net.nz/files/js/chebyshev/cheby.html  ( nice, but not so fast )

tried it out,

clen1_7 real8 -0.007090640427900377
clen2_7 real8 0.10428821903870182
clen3_7 real8 -0.6669167678906422
clen4_7 real8 0.5692306871706391

.code

SSE2clen2 proc
    mulsd       xmm0,FLT8(0.31830988618379067153776752674)  ; 1 / PI
    cvtsd2si    eax,xmm0   
    cvtsi2sd    xmm3,eax
    subsd       xmm0,xmm3
    test    eax,1
    jz      hemisphere
    xorpd       xmm0,oword ptr NegDP
hemisphere:
;Polynomial Order 7
    movsd       xmm2,xmm0               ; xmm0 = x
    addsd       xmm2,xmm2               ; xmm2 = xt2;

    movsd       xmm1,real8 ptr clen1_7  ; xmm1 b7 = b_7 = -0.007090640427900377;
    movsd       xmm3,xmm1
    mulsd       xmm3,xmm2               ; xmm3 b6 = b_7 * xt2;

    movsd       xmm4,xmm3
    mulsd       xmm4,xmm2
    addsd       xmm4,real8 ptr clen2_7
    subsd       xmm4,xmm1               ; xmm4 b5 = b_6 * xt2 + 0.10428821903870182 - b_7;

    movsd       xmm1,xmm4
    mulsd       xmm1,xmm2
    subsd       xmm1,xmm3               ; xmm1 b4 = b_5 * xt2 - b_6

    movsd       xmm3,xmm1
    mulsd       xmm3,xmm2
    addsd       xmm3,real8 ptr clen3_7
    subsd       xmm3,xmm4               ; xmm3 b3 = b_4 * xt2 + -0.6669167678906422 - b_5

    movsd       xmm4,xmm3
    mulsd       xmm4,xmm2
    subsd       xmm4,xmm1               ; xmm3 b2 = b_3 * xt2 - b_4;

    movsd       xmm1,xmm4
    mulsd       xmm1,xmm2
    addsd       xmm1,real8 ptr clen4_7
    subsd       xmm1,xmm3               ; xmm1 b1 = b_2 * xt2 + 0.5692306871706391 - b_3;
    mulsd       xmm0,xmm1   
    subsd       xmm0,xmm4               ; xmm0 b0 = x * b_1 - b_2;
    ret

SSE2clen2 endp


And this one is our goldmine i think,
http://lolengine.net/wiki/doc/maths/remez

That guy has C++ source code to calculate Sin, Cos, Tan and Exp coeffs at extreme precision.
Those are of my interest ( for my mathlib ), unfortunatly i don't have a C++ compiler to make use of it.....
Maybe you can calculate the coeffs for us ?

This is where i found the coeffs for the 9th order polynomial.

There is even a faster one with 4 coeffs,


double fastsin2(double x)
{
    const double a3 = -1.666665709650470145824129400050267289858e-1;
    const double a5 = 8.333017291562218127986291618761571373087e-3;
    const double a7 = -1.980661520135080504411629636078917643846e-4;
    const double a9 = 2.600054767890361277123254766503271638682e-6;

    return x + x*x*x * (a3 + x*x * (a5 + x*x * (a7 + x*x * a9))));
}



Quote
As I say I'm glad you studied the algo but I think you've made a mistake? I'm using

    pslld xmm2, 31
    pshufd xmm2, xmm2, 73h         ;; convert xmm2 int's to neg masks

and u thought you could improve that by

psllq xmm2, 63   ; put sign-bits in position, to place values in the right hemispheres

on the face of it that seems good but no, it doesn't work? Because the two dwords are packed into the lower 64 bits. If they were in 0..31 and 63..95 it would work but they're not. One of us is missing something?

This is ok,

    cvtpd2dq    xmm3,xmm0   ; (2 packed dpfp to 2 even int32) lose the fractional parts and keep it in xmm3 to save the signs
    psllq       xmm3,63     ; put sign-bits in position, to place values in the right hemispheres


cvtpd2dq converts 2 packed doubles to even int32 (bit 0-31 and 64-95)
cvtpd2pi converts 2 packed doubles to 2 int32 (bit 0-31 and 32-63)

Also i noticed something strange with the packed conversion instructions, in certain combinations it's better not to use xmm1 and xmm2. (xmm3 is much faster ????)
Maybe this is an intel issue ?.

When i was looking for more speed and studied different combinations, for example this piece of code below to calculate 1 double.
Try to change xmm3 in xmm1 or xmm2 and you'll notice extreme speed differences ?? (at least on my machine...)


SSE2cheby proc
   
    mulsd       xmm0,FLT8(0.31830988618379067153776752674)
    cvtsd2si    eax,xmm0   
    cvtsi2sd    xmm3,eax
    subsd       xmm0,xmm3
    test    eax,1
    jz      hemisphere
    xorpd       xmm0,oword ptr NegDP
hemisphere:
    mulsd       xmm0,FLT8(3.14159265358979323846264338327)

    movsd       xmm2,xmm0
    mulsd       xmm2,xmm2

    movsd       xmm1,real8 ptr cheby3
    mulsd       xmm1,xmm2
    addsd       xmm1,real8 ptr cheby2
    mulsd       xmm1,xmm2
    addsd       xmm1,real8 ptr cheby1
    mulsd       xmm1,xmm2
    addsd       xmm1,real8 ptr cheby0
    mulsd       xmm0,xmm1
    ret

SSE2cheby endp


Marinus
Creative coders use backward thinking techniques as a strategy.

Gunther

rrr,

Quote from: rrr314159 on April 15, 2015, 10:59:12 AM
When I ask "is it worth it?" however I'm not only thinking of the time improvement, or even the extra work figuring out AVX, I'm mainly worried about maintenance. AVX adds a bunch of complicated instructions; when bugs crop up, they will prob be found there. Also many computers can't use AVX so need to include the CPUID test. In future one must add AVX2 and AVX512 ... it's tempting to just leave it at SSE and figure "good enough" ... but really it's not.

We could go this way: An application has to figure out, which instruction set is available. The rest is done via different code paths. Intel calls this technique CPU dispatching. Of course, that can be done with macros, but this isn't real necessary.

Marinus,

Quote from: Siekmanski on April 15, 2015, 04:36:41 PM
And this one is our goldmine i think,
http://lolengine.net/wiki/doc/maths/remez

yes, it is.

On the other hand, have you guys ever thought about CORDIC algorithms? It's an old technique (mainframe remainder), but good.

Gunther
You have to know the facts before you can distort them.

rrr314159

Quote from: Siekmanski on April 15, 2015, 04:36:41 PMThis is ok,

    cvtpd2dq    xmm3,xmm0   ; (2 packed dpfp to 2 even int32) lose the fractional parts and keep it in xmm3 to save the signs
    psllq       xmm3,63     ; put sign-bits in position, to place values in the right hemispheres


cvtpd2dq converts 2 packed doubles to even int32 (bit 0-31 and 64-95)
cvtpd2pi converts 2 packed doubles to 2 int32 (bit 0-31 and 32-63)

- Some misunderstanding. According to my copy of Intel Reference Manual cvtpd2pq packs them in the low 64 bits, while cvtpd2pi uses MMX as source so it's not applicable here. Just run your algo with the same values in the two double precisions in xmm0: 3.0, 3.0. It answers 0.14112, -0.14112 - exactly what I would expect. The first is correct, not the second. Those are the results I get, anyway. Is it possible you've got AVX2 (or some other instruction set) and get different results?

I found those Cheby ref's also. The first, goddard, seemed all wrong; now I see it's using a different algo (Clenshaw) which I didn't notice b4. Since you say it's slower I guess I won't bother with it. The second, looked good but required work! At least we have a few sets of coeff's to use, I'm going to hope to find the rest of them already calc'ed somewhere, b4 going ahead and calc'ing them. Maybe Gunther would enjoy the exercise? :biggrin:

I'll get around to checking out odd behavior with xmm3, let u know what I find ...



I am NaN ;)

rrr314159

Hi Gunther,

CPU dispatching is a lot more trouble than just using SSE! I'm already doing primitive dispatching: I just check if the computer has AVX, I ignore more advanced AVX2 etc, and assume he's got at least SSE2. Intel's C compiler Dispatcher is much more sophisticated.

I use macros not for CPU dispatching, but for coding the multiple algo's required. The CPU dispatcher checks CPUID and decides which version of an algo to run on a given machine: AVX, SSE2,3,4 ..., non-SIMD, whatever (more than a dozen possibilities). It's only one piece of code, used once per run, you could use a macro but there's no point. But CPU dispatching entails having multiple versions of the same algo to call: one for AVX, one for SSE, one for old non-SIMD instructions, etc. "Orthogonal macros" (TM ;) ) allow you to write ONE algo for all these. Then via a couple flags, tailor it for AVX, SSE (which flavor), etc. The alternative - which I did originally - is to have 4 (could be more) separate algo's all with exactly the same logic but slightly different instructions. Like, one uses "movups" another uses "vmovupd". It's a horrible duplication of effort, 4 times as much code to check, 4 times the possibility for error, 4 times as many changes to make when maintaining etc. That's what the macros are for, they have nothing to do with Dispatching, rather they handle the decision made by the CPU dispatcher.

CORDIC did not seem applicable, as I remember it was good when we didn't have a fast FPU (in fact, no FPU at all) in the old days but not today. Maybe I'm misremembering?

I am NaN ;)

Gunther

rrr,

Quote from: rrr314159 on April 15, 2015, 07:33:57 PM
CPU dispatching is a lot more trouble than just using SSE!

of course, no doubt about that.

Quote from: rrr314159 on April 15, 2015, 07:33:57 PM
The alternative - which I did originally - is to have 4 (could be more) separate algo's all with exactly the same logic but slightly different instructions.

that was my idea - different code paths. For example:

  • traditional with the FPU
  • SSE
  • AVX and AVX2

Quote from: rrr314159 on April 15, 2015, 07:33:57 PM
CORDIC did not seem applicable, as I remember it was good when we didn't have a fast FPU (in fact, no FPU at all) in the old days but not today. Maybe I'm misremembering?

Yes, at the first glance. But since AVX2 we've also powerful integer instructions. Since CORDIC uses only simple shift, addittion and subtraction it could have a surprising rebirth.

Gunther
You have to know the facts before you can distort them.

Siekmanski

Quote from: rrr314159 on April 15, 2015, 07:12:45 PM
Quote from: Siekmanski on April 15, 2015, 04:36:41 PMThis is ok,

    cvtpd2dq    xmm3,xmm0   ; (2 packed dpfp to 2 even int32) lose the fractional parts and keep it in xmm3 to save the signs
    psllq       xmm3,63     ; put sign-bits in position, to place values in the right hemispheres


cvtpd2dq converts 2 packed doubles to even int32 (bit 0-31 and 64-95)
cvtpd2pi converts 2 packed doubles to 2 int32 (bit 0-31 and 32-63)

- Some misunderstanding. According to my copy of Intel Reference Manual cvtpd2pq packs them in the low 64 bits, while cvtpd2pi uses MMX as source so it's not applicable here. Just run your algo with the same values in the two double precisions in xmm0: 3.0, 3.0. It answers 0.14112, -0.14112 - exactly what I would expect. The first is correct, not the second. Those are the results I get, anyway. Is it possible you've got AVX2 (or some other instruction set) and get different results?

No, you are correct.  :t

I have been mislead by the book of James C.Leiterman "32/64-bit 80x86 Assembly Language Architecture" (use this as my reference book)
Page 167 and 168 must be wrong then.
And ofcourse i tested it with values below 1.56 .... and i didn't noticed it.  :biggrin:
Creative coders use backward thinking techniques as a strategy.

rrr314159

Quote from: Guntherthat was my idea - different code paths. For example:
1.traditional with the FPU
2.SSE
3.AVX and AVX2

The actual executable software must have different code paths for speed; you can't be making decisions at runtime like "IF SSE then movups, ELSE IF AVX then vmovups" - or whatever - it would slow down to a crawl. But the source code can mix the "code paths" as long as these decisions are made at compile time not run time. That's what my macros do.

BTW I've never seen software with multiple code paths for Single precision vs. Double. Everyone seems to follow the rule "Use single if you can, double if you must" - if you need the extra precision. But I treat single vs. double like AVX vs. SSE (or whatever). I include both code paths and dynamically decide which to use depending on the user's current precision requirements. Single is so much faster, (with SIMD, not FPU), that I find it very worthwhile.

I'll be posting a comprehensive example soon (I hope) to clarify these concepts.

Quote from: Gunther...since AVX2 we've also powerful integer instructions. Since CORDIC uses only simple shift, addittion and subtraction it could have a surprising rebirth.

Well I'm happy with the speed we've achieved for sin/cos using current approach, but this brings up an idea. All these algos are very floating-point intensive, the integer units are sitting there idle for the most part. I bet if we find something useful to do with integers, by interleaving those instructions with the floating point they'd execute in parallel, almost for free. I always keep those idle integer units in mind: maybe CORDIC, or something, will put them to work.

Quote from: siekmanskiI have been mislead by the book of James C.Leiterman "32/64-bit 80x86 Assembly Language Architecture" (use this as my reference book) Page 167 and 168 must be wrong then.

I don't trust such books never saw one without quite a few mistakes, particularly on difficult points where one most needs help. There's a lot of politics involved, authors often chosen for who they know, not what they know. "Those who can, program; those who can't, write books" Having said that I love such books, after I've studied it myself. Then I always pick up a few valuable points. In fact I hope/expect you will come up with such points from this book: no doubt it has mistakes but also good stuff I've missed.

I recommend "64-ia-32 arcitectures software developers instruction set reference manual" from Intel. It's full of unimportant mistakes, such as missing index references, some spelling mistakes, and horrible instructional material: obviously written by a programmer / techie not a teacher. But there's not one single technical mistake (that I've found)! It's perfectly accurate. Also complete: nothing missing, altho often u have to dig to find it. It's the only book I've used for this topic: pick of the litter, IMHO.
I am NaN ;)

jj2007

Quote from: rrr314159 on April 16, 2015, 08:01:28 AMBut the source code can mix the "code paths" as long as these decisions are made at compile time not run time. That's what my macros do.

That implies, however, that you have to provide your customers with different executables by CPU, as code written for modern CPUs will crash on older ones.

nidud

#117
deleted

rrr314159

#118
Quote from: jj2007That implies, however, that you have to provide your customers with different executables by CPU, as code written for modern CPUs will crash on older ones.

- Well that's not what I meant to imply! I have done this successfully, BTW, on a small scale: provided progs to family / friends (not general public). The exe includes all code, FPU up to AVX. Before hitting the advanced code I check CPUID to see if it's supported, if not, I don't allow that code path to be executed. If (for instance) AVX is supported I default to that but still allow selection of "lesser" paths, sometimes there's good reason. For instance AMD's sometimes do the older algo (SSE, even FPU) faster than AVX. That can also happen on older machines where the AVX (or, SSE) implementation does work but it sucks and is very slow. Anyway, if the CPU doesn't support AVX (for instance) it's perfectly OK for the .exe to have those advanced instructions as long as the prog does NOT execute them. This approach has worked fine.

@nidud: I don't really believe in DLL's, don't use them, my .exe links to static library; all code is included all the time. At the most it gets up to 200K (usually much less). Even 15-year old machines have so much memory that this is no problem. I never use any math libraries, everything is written from scratch. Well, sometimes I borrow/steal source code from wherever I can get it, but re-compile it myself. Of course I also use kernel32, user32 as necessary, in the normal way.

My progs only run on Windows, they're all sort of special-purpose (you could call them "math-based games"). For one example, I never use Unicode. So, I just don't run into the problems you do when writing general-purpose stuff for a wide audience.

[edit] reading this over seems I might be sort of exaggerating my experience - I've only had 3 programs used by a few people in the last few months. And all except one of them did it just to make me happy, don't give a **** about my math games! Let's face it, a complete amateur. Bill Gates doesn't have to worry about competition from me anytime soon. Still, I think my progs are pretty cool - that's something.
I am NaN ;)

Raistlin

I've been reading these posts with complete facination and awe <-- that 's a good thing incase you were wondering.

With a vested interest in what's been discussed I'd like to ask the following questions
for clarrification - please feel free to ignore these if you feel they are stupid:

1) With LUT (look up tables) of varying size - is there a method by which we could
    ensure the CPU identified has the maximum chance of caching the table directly
    so we could reduce the number of main memory accesses. Therefore could we
    programmatically (dynamically) ensure the lookup table beieng generated (precision dependant)
    would fall nicely into CPU cache?
2) When creating real time applications that have an engineering/scientific focus what would be recommend
    precision - without sacrificing cycles - the best-practise balanced approach?
3) Does SSE/FPU use; preclude or have large impacts in speed in theoretical
    parrallel/congurrent environments that you could highlight?

Thanks
R
Are you pondering what I'm pondering? It's time to take over the world ! - let's use ASSEMBLY...