News:

Masm32 SDK description, downloads and other helpful links
Message to All Guests
NB: Posting URL's See here: Posted URL Change

Main Menu

Trigonometry ...

Started by rrr314159, March 26, 2015, 04:48:39 PM

Previous topic - Next topic

Antariy

Every machine has QueryPerformanceCounter, but the problem is some OSes return the processor frequency with the call to the QueryPerformanceFrequency, and some return not it but some other value, with smaller than TSC resolution.

dedndave

if he "beats" QueryPerformanceCounter against QueryPerformanceFrequency, it should work correctly either way   :biggrin:

Antariy

Quote from: dedndave on March 27, 2015, 05:39:33 AM
if he "beats" QueryPerformanceCounter against QueryPerformanceFrequency, it should work correctly either way   :biggrin:

That's right, just noted that because once has a faulty cycles counting measure due to Win7 provides not cpu freq but rather some fixed freq.

Siekmanski

#18
Quotebut - it only looks up one value at a time, right?

Yes, I never needed to calculate more than one at the time.
You need a lot of shuffles to do 4 or 8 at once.
And if so, I precalculate those.

Here's a new one with the useless "sub" removed in part3:
And routines using a Jump table. Don't now if it's faster, haven't tested it yet ( maybe tomorrow )

Edit: new version see Reply #42
Creative coders use backward thinking techniques as a strategy.

qWord

#19
rrr314159,

the precision significantly decrease for negative values. Also, you did know that FISTTP is an SSE3 instruction?

Just for compare, below an SSE2 implementation for sin(x) (also TP).  Due to the needed truncation the domain is approx. -3.37e9 to +3.37e9. The relative error is below 10-11.
    OPTION PROLOGUE:none
    OPTION EPILOGUE:none
    align 16
    sinsd proc
       
        .const
            align 16
            ; head: pi/2  ( hx = Round(mantissa(pi/2)*2^23 )/2^23 )
            sinsd_cc0       REAL8 1.57080078125
            ; tail: pi/2  ( tx = pi /2 - hx )
            sinsd_cc1       REAl8 -4.45445510338076867830836E-6
           
            ; 2/pi
            sinsd_2_pi  REAL8  0.6366197723675813430755351  ; 2/pi
   
            sinsd_qdrnt REAl8  0.0
                        REAl8  1.5707963267948966192313217
                        REAl8  0.0
                        REAl8 -1.5707963267948966192313217
           
            sinsd_c     REAL8 -7.647163731819816475901132E-13
                        REAL8 1.6059043836821614599392377E-10
                        REAL8 -2.50521083854417187750521E-8
                        REAL8 2.75573192239858906525573E-6
                        REAL8 -0.19841269841269841269841E-3
                        REAL8 8.333333333333333333333333E-3
                        REAL8 -0.16666666666666666666667
                        REAL8 1.0
        .code
       
       
        ; k = Floor(x*2/Pi)
        ; xr = (x-C0*k) - C1*k
        ; REF: Jean-Michel Muller, "Elementary Functions, Algorithms and Implementation", Second edition
        movsd xmm3,xmm0
        mulsd xmm3,sinsd_2_pi
        cvttsd2si eax,xmm3
        cvtsi2sd xmm1,eax
        mov edx,eax
        and edx,3
        movsd xmm2,xmm1
        mulsd xmm1,sinsd_cc0
        mulsd xmm2,sinsd_cc1
        subsd xmm0,xmm1
        subsd xmm0,xmm2
       
        jz @calc
        movsd xmm1,xmm0
        movsd xmm0,sinsd_qdrnt[edx*REAL8]
   
        and eax,80000003h
        js @s
        jnp @np
        jp @p
    @s: test eax,2
        jz @p
    @np:    subsd xmm0,xmm1
        jmp @calc
    @p:     addsd xmm0,xmm1
    @calc:
   
        movsd xmm2,xmm0
        mulsd xmm2,xmm2
        movsd xmm3,sinsd_c[0*REAl8]
        mulsd xmm3,xmm2
        addsd xmm3,sinsd_c[1*REAl8]
        mulsd xmm3,xmm2
        addsd xmm3,sinsd_c[2*REAl8]
        mulsd xmm3,xmm2
        addsd xmm3,sinsd_c[3*REAl8]
        mulsd xmm3,xmm2
        addsd xmm3,sinsd_c[4*REAl8]
        mulsd xmm3,xmm2
        addsd xmm3,sinsd_c[5*REAl8]
        mulsd xmm3,xmm2
        addsd xmm3,sinsd_c[6*REAl8]
        mulsd xmm3,xmm2
        addsd xmm3,sinsd_c[7*REAl8]
        mulsd xmm0,xmm3
   
        ret
   
    sinsd endp


regards
MREAL macros - when you need floating point arithmetic while assembling!

daydreamer

I have old code that creates+rotates etc+x/z,y/z cheating with help of SSE reciprocal instructions, two Points at a time, with help of a sincossincos 256 entry LUT
if anone is interested I dig up code

my none asm creations
https://masm32.com/board/index.php?topic=6937.msg74303#msg74303
I am an Invoker
"An Invoker is a mage who specializes in the manipulation of raw and elemental energies."
Like SIMD coding

Siekmanski

Creative coders use backward thinking techniques as a strategy.

Siekmanski

#22
Here are all the routines and the timing results.
I have included a jump table routine.

for the sources see Reply #42
Creative coders use backward thinking techniques as a strategy.

MichaelW

You can partially compensate for the lower frequency of the performance counters, relative to the TSC, by synchronizing with the counter at the start of the timed period, effectively reducing the uncertainty from two counter cycles to one. Given the variable clock speed for recent processors, and assuming that there is no reliable method of forcing the processor to run at a fixed frequency, for real-time timings the performance counters are all we have to work with.
Well Microsoft, here's another nice mess you've gotten us into.

rrr314159

#24
Hi all,

@qword, thanks for catching that, it's not supposed to be fisttp but fistp. Works the same for positive numbers but for negative it needs fistp with truncate (down to -inf) rounding mode, then the precision remains unchanged. I'll post a fix soon. I'll also take a look at your SSE algo ... the precision at 10^-11 is much better than I need, but more never hurts.

@dedndave, you're suggesting (I think) that you and nidud had considerably longer timings, so somewhere a multiplication that worked OK on my machine is overflowing into the high word on yours - which I'm neglecting to check. Makes sense, if so be easy to fix. I'll take a look ...

@MichaelW, rdtsc works fine on my machines, even with variable clock - I get the exact same numbers as with perf counter (but with 3 extra digits of course). Perhaps I need to check it for longer times, I rarely go above 10 seconds or so - or just look more carefully ;) It's often important to get nanosecond timing, so I can't abandon rdtsc. Therefore I've made two versions of timing routines, using each. For this project, as it turned out I could even use your timers.asm; as posted, I'm only timing one big loop. When I started however I was timing many short segments, doing lots of nanosecond calculations, and not looping; so needed my routines, and just left them in final version. If I get around to it I'll replace them.

@siekmanski, here's what I wrote b4 reading recent posts,

Your LUT technique is better than what I was doing in a couple of ways, checking for quadrants is simpler and faster, so I'll borrow that - thanks! [edit] - I see you've posted an "even better" jump table version, I'll take a look ...

When fed numbers in the right range it's about twice as fast as the Taylor Series routine at about the same precision. But that doesn't change the conclusion that, without vgather instruction, the T.S. should be better for SIMD. It should be able to do 4 numbers at once (or, 8 with AVX) and therefore still be twice (or more) as fast as the LUT - I hope. I'll let u know when finished. [edit] qword's posting may be just what I need ...

However you do have a couple bugs - which probably never happen the way you use the routine. First, gives wrong answer at exactly pi/2 or 3pi/2. This can be fixed two ways. First, use truncate mode instead of the usual rounding mode (probably that's what you do). With normal rounding mode, it will round up to the 16,385th entry. So you can also put one more entry at the top of the table to catch it - gives exact right answer, 1, and only costs 1 byte. Both these fixes are shown below, FWIW.

Second, it doesn't handle larger numbers right. At 10,000 the answers are less precise by almost two digits; at 100,000 loses another digit; and fails entirely somewhere between there and 1,000,000. Basic reason (I figure), you convert real number to integer before truncating down to the first quadrant. If you reduce to the first quadrant using FPU, and only then convert to integer, it avoids this problem; but that's a lot slower. I have to handle larger numbers all the time, so do it that way; probably you don't (or would have noticed this problem). Not sure best way to fix it (or even if u need to fix it), so I left it alone.

Here's the slightly modified LUT as I tested it:

; SSE Sin Cos routine by Siekmanski 2010.
include \masm32\include\masm32rt.inc
     .686p
     .xmm

; minor mods rrr314159 2015/03/25

FLT4 MACRO float_number:REQ
LOCAL float_num
    .data
     align 4
     float_num real4 float_number
    .code
    EXITM <float_num>
ENDM

.data?
align 16
sseSinTable real4 1 + 65536/4 dup (?)          ; rrr, add one more entry at top

temp4 real4 ?                                  ; rrr, for going from FPU to xmm

.data
align 16
NegSP dd 080000000h,000000000h,000000000h,000000000h

.code

align 4
InitSinTable proc public uses edi
LOCAL PI2:real4
LOCAL SinCounter:real4

    finit
    fclex
   
;    call initFPUsettruncatemode                ; rrr, set truncate mode or else add one more entry at top
                                                ; I added the entry because it's a little more accurate
fld FLT4(0.0)
fstp SinCounter
fldpi
fadd st(0),st(0)
fstp PI2
lea edi,sseSinTable
mov ecx,65536/4
sin_lp:
fld SinCounter
fdiv FLT4(65536.0)
fmul PI2
fsin
fstp real4 ptr[edi]
fld SinCounter
fadd FLT4(1.0)
fstp SinCounter
add edi,4
dec ecx
; jnz sin_lp
jge sin_lp                    ; rrr, jge instead of jnz to add entry at top
ret

InitSinTable endp


align 16
SSE_Sin proc ; in: xmm0, out: xmm0 with sin value

; modified rrr314159 2015/03/25 to use st(0) in and out of xmm0

fstp temp4
movss xmm0, temp4

mulss xmm0,FLT4(10430.3783504704527249495663163) ; 65536 / PI2
cvttss2si eax,xmm0

and eax,65535
cmp eax,16384-1
ja Part2
movss xmm0,real4 ptr [sseSinTable+eax*4]
jmp rrrret
Part2:
cmp eax,16384*2-1
ja Part3
mov edx,16384*2
sub edx,eax
movss xmm0,real4 ptr [sseSinTable+edx*4]
jmp rrrret
Part3:
cmp eax,16384*3-1
ja Part4
sub eax,16384*2
movss xmm0,real4 ptr [sseSinTable+eax*4]
xorps xmm0,oword ptr NegSP
jmp rrrret
Part4:
mov edx,16384*4
sub edx,eax
movss xmm0,real4 ptr [sseSinTable+edx*4]
xorps xmm0,oword ptr NegSP

rrrret:
    movss temp4, xmm0
    fld temp4
    ret

SSE_Sin endp


align 16
SSE_Cos proc ; in: xmm0, out: xmm0 with cos value

; modified rrr314159 2015/03/25 to use st(0) in and out of xmm0

fstp temp4
movss xmm0, temp4

mulss xmm0,FLT4(10430.3783504704527249495663163) ; 65536 / PI2
cvttss2si eax,xmm0
add eax,16384
and eax,65535
cmp eax,16384-1
ja Part2
movss xmm0,real4 ptr [sseSinTable+eax*4]
jmp rrrret
Part2:
cmp eax,16384*2-1
ja Part3
mov edx,16384*2
sub edx,eax
movss xmm0,real4 ptr [sseSinTable+edx*4]
jmp rrrret
Part3:
cmp eax,16384*3-1
ja Part4
sub eax,16384*2
movss xmm0,real4 ptr [sseSinTable+eax*4]
xorps xmm0,oword ptr NegSP
jmp rrrret
Part4:
mov edx,16384*4
sub edx,eax
movss xmm0,real4 ptr [sseSinTable+edx*4]
xorps xmm0,oword ptr NegSP

rrrret:
    movss temp4, xmm0
    fld temp4
    ret

SSE_Cos endp

; «««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««
.data?
    thecontrolword dw ?
.code
; «««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««
initFPUsettruncatemode:
    finit
    fstcw thecontrolword
    fwait
    mov   ax,thecontrolword
    or    ax,0400h  ;this will set both bits of the RC field to the truncating mode
                    ; rounding down to -infinity as math INT function

    mov thecontrolword, ax
    fldcw thecontrolword     ;load the modified Control Word
ret

end


Here are a few test runs to illustrate the LUT's 2X higher speed, relative to T.S., and also the large-number problem:

WITH NUMBERS NEAR 0.0

    ----------------------------------------
FPU fsin nanos per iter         31.842
FPU fcos nanos per iter         31.906

Test Function: SIEKMANSKI========================================

********** SIN Using faster version **********
Nanoseconds per Iteration       1.846
Speed ratio FPU/test fn         17.3

Precision
average precision               3.09e-005
worst precision                 9.25e-005 ; GOOD PRECISION

********** COS Using faster version **********
Nanoseconds per Iteration       1.822
Speed ratio FPU/test fn         17.5

Precision
average precision               3.09e-005
worst precision                 9.34e-005

Test Function: trigC ========================================

********** SIN Using faster version **********
Nanoseconds per Iteration       3.562
Speed ratio FPU/test fn         8.95

Precision
average precision               9.98e-005
worst precision                 8.27e-004

********** COS Using faster version **********
Nanoseconds per Iteration       3.317
Speed ratio FPU/test fn         9.61

Precision
average precision               1.00e-004
worst precision                 8.94e-004

**************************************************************
WITH NUMBERS NEAR 10000.0

    ----------------------------------------
FPU fsin nanos per iter         31.873
FPU fcos nanos per iter         31.955

Test Function: SIEKMANSKI========================================

********** SIN Using faster version **********
Nanoseconds per Iteration       1.808
Speed ratio FPU/test fn         17.7

Precision
average precision               5.46e-004
worst precision                 1.42e-003

********** COS Using faster version **********
Nanoseconds per Iteration       1.810
Speed ratio FPU/test fn         17.6

Precision
average precision               5.45e-004
worst precision                 1.36e-003 ; LOSING PRECISION

Test Function: trigC ========================================

********** SIN Using faster version **********
Nanoseconds per Iteration       3.548
Speed ratio FPU/test fn         9

Precision
average precision               1.00e-004
worst precision                 8.92e-004

********** COS Using faster version **********
Nanoseconds per Iteration       3.555
Speed ratio FPU/test fn         8.98

Precision
average precision               9.98e-005
worst precision                 8.28e-004

**************************************************************
WITH NUMBERS NEAR 100000.0

    ----------------------------------------
FPU fsin nanos per iter         31.800
FPU fcos nanos per iter         31.895

Test Function: SIEKMANSKI========================================

********** SIN Using faster version **********
Nanoseconds per Iteration       1.806
Speed ratio FPU/test fn         17.6

Precision
average precision               5.33e-003
worst precision                 1.42e-002 ; LOSING PRECISION

********** COS Using faster version **********
Nanoseconds per Iteration       1.739
Speed ratio FPU/test fn         18.3

Precision
average precision               5.16e-003
worst precision                 1.41e-002

Test Function: trigC ========================================

********** SIN Using faster version **********
Nanoseconds per Iteration       3.518
Speed ratio FPU/test fn         9.05

Precision
average precision               9.99e-005
worst precision                 8.74e-004

********** COS Using faster version **********
Nanoseconds per Iteration       3.536
Speed ratio FPU/test fn         9.01

Precision
average precision               9.98e-005
worst precision                 8.46e-004


I am NaN ;)

dedndave

Quote from: rrr314159 on March 27, 2015, 09:17:17 PM
@dedndave, you're suggesting (I think) that you and nidud had considerably longer timings...

no - our older processors seem to be faster than all the new ones   :biggrin:
i took a quick look and noticed you are using the FPU for calculations
so, my theory about ignoring a high dword probably isn't the case

daydreamer

I read that pixelshader is capable of have 1 cycle lookup of sin and cos,anyone capable of writing pixelshader that outputs LUT to system memory?
its earlier ddraw that doesnt work on newer OS,win xp tops ,actually should be wieved in 8bit color palette, sorry cant find the latest source for it
this is very early version I found
initcircle PROC
LOCAL @@stp :REAL4
LOCAL @@cnt :REAL4
    finit
    fldpi
    fidiv degrees
    fstp @@stp
    fldz
    fstp @@cnt
    mov eax,0
    mov ecx,[degrees]
    add ecx,ecx
    add ecx,ecx
@l1:fld @@cnt
    fsincos
    fstp REAL4 PTR [eax*8+sintab]
    fstp REAL4 PTR [eax*8+sintab+4]
    fld @@cnt
    fadd @@stp
    fstp @@cnt
    inc eax
    dec ecx
    jne @l1
   

   
ret
initcircle ENDP
circle2 PROC ;SSE version
    mov ecx,[degrees]
    add ecx,ecx
   
    lea edi,[buffer]
    MOVAPS XMM3,[xcentref]
    MOVAPS XMM2,[radius]
@l1:    MOVAPS XMM1,[ecx*8+sintab]
    MULPS XMM1,XMM2
    ADDPS XMM1,XMM3
    MOVAPS [tsse],XMM1
    ;CVTPS2PI MM1,XMM1
    ;MOVQ [tsse],MM1
    ;MOVHLPS XMM0,XMM1
    ;CVTPS2PI MM1,XMM0
    ;MOVQ [tsse+8],MM1
    fld tsse
    fistp tsse
    fld tsse+4
    fistp tsse+4
    fld tsse+8
    fistp tsse+8
    fld tsse+12
    fistp tsse+12
    mov edx,[tsse]
    sal edx,8 ;or whatever it needs to be scaled to
    add edx,[tsse+4]
    mov eax,0f0f0f0f0h
    mov [edi+edx],eax
    mov edx,[tsse+8]
    sal edx,8
    add edx,[tsse+12]
    mov [edi+edx],eax
       
    sub ecx,2
    jne @l1
    ret   
   

circle2 ENDP

here is my first concept of SSE rotation code

    Xt = X*COS(í) - Y*SIN(í)
    Yt = X*SIN(í) + Y*COS(í)
SSE:MOVAPS XMM0,X4
MOVAPS XMM2,Y4
MOVAPS XMM1,XMM0
MOVAPS XMM3,XMM2
MULPS XMM0,COSTAB
MULPS XMM1,SINTAB
MULPS XMM2,SINTAB
MULPS XMM3,COSTAB
SUBPS XMM0,XMM2
ADDPS XMM1,XMM3
;
MOVAPS XMM0,X4
MOVAPS XMM1,Y4

and here is my second initial code designed with dx3d in mind,thats why I choosed XYZWUV vertex format ,use of smaller format like xyzw vertex format and you can exchange all slower movups with movaps
movaps XMM0,XYZW
    movups XMM4,XYZW+24
    movaps XMM5,XYZW+48
    movups XMM6,XYZW+24*3
    movaps XMM7,XYZW+96
    movaps XMM1,xscale1cossin??? ;why not use a free scale of X?
    movaps XMM2,xscale2sincos???
    mulps XMM0,XMM1
    movaps ???ycosminuszsin???,XMM0
    mulps XMM4,XMM1
    movaps ???ycosminuszsin???+16,XMM4
    mulps XMM5,XMM1
    movaps XMM0,XYZW ;load again how many clock cycles after store?
    movaps ???ycosminuszsin???+32,XMM5
    mulps XMM6,XMM1
    movaps ???ycosminuszsin???+48,XMM6
    mulps XMM7,XMM1
    movaps ???ycosminuszsin???+64,XMM7
    mulps XMM2,XMM0
    movaps ???ycosminuszsin???,XMM0
    movaps ???ysinpluszcos???,XMM2
    ;movss/subss/movss and movss/addss/movss section
      ;after
    movss X,XMM1 
    movss XMM3,Ycos ;now if I input a reg for example eax to this proc for
    subss XMM3,zsin ;what change what axis I gonna rotate
    movss Y,XMM3
    movss XMM4,ysin
    addss XMM4,zcos
    movss Z,XMM4
data XYZWUV ;24byte w UV coordinates
data coscossindummy ;unused in rotate can be used to scale variable
data sinsincosdummy
data xcosycosminuszsindummy
data xcosycosminuszsindummy
   
my none asm creations
https://masm32.com/board/index.php?topic=6937.msg74303#msg74303
I am an Invoker
"An Invoker is a mage who specializes in the manipulation of raw and elemental energies."
Like SIMD coding

rrr314159

#27
Quote from: siekmanskiYou need a lot of shuffles to do 4 or 8 at once.

My case is the opposite: The numbers are being handed to me already packed in a xmm or ymm register. So I'd have to do "a lot of shuffles" to use the LUT. But your routine is so fast, it almost looks worthwhile to unpack the data, apply the LUT one by one, and repack it!

These SIMD instructions always seem to be missing the one instruction u need to make it work (like, integer arithmetic, or vgather). And, they're often disappointingly slow. Nevertheless, they do give a big advantage in straight floating point math performed on data already lined up in memory.

Quote from: dedndavei took a quick look and noticed you are using the FPU for calculations
so, my theory about ignoring a high dword probably isn't the case

That's right: I'm putting rdtsc output into a real8 right away, no "long integer". Better if you're doing a lot of calculations. Plus my routines can be stuck into "live" code very easily - that's why I prefer them. U know, it's worth noting that in another thread jj2007's "bible" routine wasn't working on your old machine ... so, at the moment I suspect it's something to do with that. (I'm hoping to goad u into figuring out the problem, to defend your computer's reputation :)

@Antariy, thanks for your timing numbers, T.S. routines are slow on those older machines, maybe I should just wait a few years until everybody replaces them with i7's like siekmanski's :)

@qword,

I understand you're not endorsing this SSE algo, just putting it out as an example, and as such it's very useful, gives me a baseline to compare to, so thanks! But I'll bet it can be done better!

First, he's got too many coefficients: the x^15 term must be meaningless, given real8 accuracy. Also (I think) he's mult'ing them the wrong way: starts with smallest terms, up to largest. That means the x^13 coeff (at 10^-10) gets added 6 times, mult'ed 7 times! - when u only need to mult it once, and add. That term will be lost in the noise.

[edit] Got the algo working and I was extremely wrong. Those two terms both add 2 digits precision - and cost 10% in time (very close). So the algo as written has about 12 digits precision, and is only 1.2 times faster than FPU; take off last term, get 10 digits, but now it's about 1.32 faster. Very interesting, thanks again qWord

Also his quadrant determination looks bad - u only need 2 comparisons - he's even got 3 branches one after the other, a no-no (according to Agner Fog). I have a trick that gets it down to only one comparison (altho I'm pretty sure the overhead makes it not worthwhile). (I think) there are other problems ... but let me work on it more, I could be wrong about some (or all) of these opinions, haven't actually tested them yet.

@daydreamer2, Yes I've heard the same thing about pixelshader and other GPGPU floating point units, would love to take advantage of it. ... thx for the code!
I am NaN ;)

Siekmanski

Hi, rrr314159

QuoteHowever you do have a couple bugs - which probably never happen the way you use the routine. First, gives wrong answer at exactly pi/2 or 3pi/2. This can be fixed two ways. First, use truncate mode instead of the usual rounding mode (probably that's what you do). With normal rounding mode, it will round up to the 16,385th entry. So you can also put one more entry at the top of the table to catch it - gives exact right answer, 1, and only costs 1 byte. Both these fixes are shown below, FWIW.

You are right, didn't noticed this rounding error.

QuoteSecond, it doesn't handle larger numbers right. At 10,000 the answers are less precise by almost two digits; at 100,000 loses another digit; and fails entirely somewhere between there and 1,000,000. Basic reason (I figure), you convert real number to integer before truncating down to the first quadrant. If you reduce to the first quadrant using FPU, and only then convert to integer, it avoids this problem; but that's a lot slower. I have to handle larger numbers all the time, so do it that way; probably you don't (or would have noticed this problem). Not sure best way to fix it (or even if u need to fix it), so I left it alone.

Never used numbers larger than 2pi, but i'll have a look into this larger number issue.
Creative coders use backward thinking techniques as a strategy.

Siekmanski

Rounding mistake solved:

.data?
align 16
sseSinTable     real4 65536/4 dup (?)

.data
align 4
NegSP           dd 080000000h
QuadrantJump    dd offset Quadrant1,offset Quadrant2,offset Quadrant3,offset Quadrant4

.code

align 4
InitSinTable proc uses edi
LOCAL PI2:real4
LOCAL SinCounter:real4
   
    finit
    fclex

    fld     FLT4(0.5)
    fstp    SinCounter
    fldpi
    fadd    st(0),st(0)
    fstp    PI2
    lea     edi,sseSinTable
    mov     ecx,65536/4
sin_lp:
    fld     SinCounter
    fmul    PI2
    fdiv    FLT4(65536.0)   
    fsin
    fstp    real4 ptr[edi]
    fld     SinCounter
    fadd    FLT4(1.0)
    fstp    SinCounter
    add     edi,4
    dec     ecx
    jnz     sin_lp
    ret
InitSinTable endp

align 4
SSE_Sin proc
    mulss       xmm0,FLT4(10430.2191955273608296137974326)  ; 65535 / PI2
    cvttss2si   eax,xmm0
    and         eax,65535
    mov         edx,eax
    shr         edx,14
    jmp         [QuadrantJump+edx*4]
SSE_Sin endp

align 4
SSE_Cos proc
    mulss       xmm0,FLT4(10430.2191955273608296137974326)  ; 65535 / PI2
    cvttss2si   eax,xmm0
    add         eax,16384
    and         eax,65535
    mov         edx,eax
    shr         edx,14
    jmp         [QuadrantJump+edx*4]
SSE_Cos endp

align 4
Quadrant1:
    movss       xmm0,real4 ptr [sseSinTable+eax*4]
    ret
align 4
Quadrant2:
    mov         edx,16384*2
    sub         edx,eax
    movss       xmm0,real4 ptr [sseSinTable+edx*4]
    ret
align 4
Quadrant3:
    movss       xmm0,real4 ptr [sseSinTable-(16384*2*4)+eax*4]
    xorps       xmm0,real4 ptr NegSP
    ret
align 4
Quadrant4:
    mov         edx,16384*4
    sub         edx,eax
    movss       xmm0,real4 ptr [sseSinTable+edx*4]
    xorps       xmm0,real4 ptr NegSP
    ret


Next, attack larger number issue.
Creative coders use backward thinking techniques as a strategy.