Trigonometry ...

Antariy · March 27, 2015, 05:33:33 AM

Every machine has QueryPerformanceCounter, but the problem is some OSes return the processor frequency with the call to the QueryPerformanceFrequency, and some return not it but some other value, with smaller than TSC resolution.

dedndave · March 27, 2015, 05:39:33 AM

if he "beats" QueryPerformanceCounter against QueryPerformanceFrequency, it should work correctly either way

Antariy · March 27, 2015, 05:57:08 AM

Quote from: dedndave on March 27, 2015, 05:39:33 AM
if he "beats" QueryPerformanceCounter against QueryPerformanceFrequency, it should work correctly either way

That's right, just noted that because once has a faulty cycles counting measure due to Win7 provides not cpu freq but rather some fixed freq.

Siekmanski · March 27, 2015, 06:13:35 AM

Quotebut - it only looks up one value at a time, right?

Yes, I never needed to calculate more than one at the time.
You need a lot of shuffles to do 4 or 8 at once.
And if so, I precalculate those.

Here's a new one with the useless "sub" removed in part3:
And routines using a Jump table. Don't now if it's faster, haven't tested it yet ( maybe tomorrow )

Edit: new version see Reply #42

qWord · March 27, 2015, 06:43:19 AM

rrr314159,

the precision significantly decrease for negative values. Also, you did know that FISTTP is an SSE3 instruction?

Just for compare, below an SSE2 implementation for sin(x) (also TP). Due to the needed truncation the domain is approx. -3.37e9 to +3.37e9. The relative error is below 10^-11.

Code Select

    OPTION PROLOGUE:none
    OPTION EPILOGUE:none
    align 16
    sinsd proc
        
        .const
            align 16
            ; head: pi/2  ( hx = Round(mantissa(pi/2)*2^23 )/2^23 )
            sinsd_cc0       REAL8 1.57080078125
            ; tail: pi/2  ( tx = pi /2 - hx )
            sinsd_cc1       REAl8 -4.45445510338076867830836E-6
            
            ; 2/pi
            sinsd_2_pi  REAL8  0.6366197723675813430755351  ; 2/pi
    
            sinsd_qdrnt REAl8  0.0
                        REAl8  1.5707963267948966192313217
                        REAl8  0.0
                        REAl8 -1.5707963267948966192313217
            
            sinsd_c     REAL8 -7.647163731819816475901132E-13
                        REAL8 1.6059043836821614599392377E-10
                        REAL8 -2.50521083854417187750521E-8
                        REAL8 2.75573192239858906525573E-6
                        REAL8 -0.19841269841269841269841E-3
                        REAL8 8.333333333333333333333333E-3
                        REAL8 -0.16666666666666666666667
                        REAL8 1.0
        .code
        
        
        ; k = Floor(x*2/Pi)
        ; xr = (x-C0*k) - C1*k
        ; REF: Jean-Michel Muller, "Elementary Functions, Algorithms and Implementation", Second edition
        movsd xmm3,xmm0
        mulsd xmm3,sinsd_2_pi
        cvttsd2si eax,xmm3
        cvtsi2sd xmm1,eax
        mov edx,eax
        and edx,3
        movsd xmm2,xmm1
        mulsd xmm1,sinsd_cc0
        mulsd xmm2,sinsd_cc1
        subsd xmm0,xmm1
        subsd xmm0,xmm2
        
        jz @calc
        movsd xmm1,xmm0
        movsd xmm0,sinsd_qdrnt[edx*REAL8]
    
        and eax,80000003h
        js @s
        jnp @np
        jp @p
    @s: test eax,2
        jz @p
    @np:    subsd xmm0,xmm1
        jmp @calc
    @p:     addsd xmm0,xmm1
    @calc:
    
        movsd xmm2,xmm0
        mulsd xmm2,xmm2
        movsd xmm3,sinsd_c[0*REAl8]
        mulsd xmm3,xmm2
        addsd xmm3,sinsd_c[1*REAl8]
        mulsd xmm3,xmm2
        addsd xmm3,sinsd_c[2*REAl8]
        mulsd xmm3,xmm2
        addsd xmm3,sinsd_c[3*REAl8]
        mulsd xmm3,xmm2
        addsd xmm3,sinsd_c[4*REAl8]
        mulsd xmm3,xmm2
        addsd xmm3,sinsd_c[5*REAl8]
        mulsd xmm3,xmm2
        addsd xmm3,sinsd_c[6*REAl8]
        mulsd xmm3,xmm2
        addsd xmm3,sinsd_c[7*REAl8]
        mulsd xmm0,xmm3
    
        ret
    
    sinsd endp

regards

daydreamer · March 27, 2015, 07:03:01 AM

I have old code that creates+rotates etc+x/z,y/z cheating with help of SSE reciprocal instructions, two Points at a time, with help of a sincossincos 256 entry LUT
if anone is interested I dig up code

Siekmanski · March 27, 2015, 08:14:12 AM

Yeah, i am !!!

Siekmanski · March 27, 2015, 12:09:51 PM

Here are all the routines and the timing results.
I have included a jump table routine.

for the sources see Reply #42

MichaelW · March 27, 2015, 02:34:26 PM

You can partially compensate for the lower frequency of the performance counters, relative to the TSC, by synchronizing with the counter at the start of the timed period, effectively reducing the uncertainty from two counter cycles to one. Given the variable clock speed for recent processors, and assuming that there is no reliable method of forcing the processor to run at a fixed frequency, for real-time timings the performance counters are all we have to work with.

rrr314159 · March 27, 2015, 09:17:17 PM

Hi all,

@qword, thanks for catching that, it's not supposed to be fisttp but fistp. Works the same for positive numbers but for negative it needs fistp with truncate (down to -inf) rounding mode, then the precision remains unchanged. I'll post a fix soon. I'll also take a look at your SSE algo ... the precision at 10^-11 is much better than I need, but more never hurts.

@dedndave, you're suggesting (I think) that you and nidud had considerably longer timings, so somewhere a multiplication that worked OK on my machine is overflowing into the high word on yours - which I'm neglecting to check. Makes sense, if so be easy to fix. I'll take a look ...

@MichaelW, rdtsc works fine on my machines, even with variable clock - I get the exact same numbers as with perf counter (but with 3 extra digits of course). Perhaps I need to check it for longer times, I rarely go above 10 seconds or so - or just look more carefully ;) It's often important to get nanosecond timing, so I can't abandon rdtsc. Therefore I've made two versions of timing routines, using each. For this project, as it turned out I could even use your timers.asm; as posted, I'm only timing one big loop. When I started however I was timing many short segments, doing lots of nanosecond calculations, and not looping; so needed my routines, and just left them in final version. If I get around to it I'll replace them.

@siekmanski, here's what I wrote b4 reading recent posts,

Your LUT technique is better than what I was doing in a couple of ways, checking for quadrants is simpler and faster, so I'll borrow that - thanks! [edit] - I see you've posted an "even better" jump table version, I'll take a look ...

When fed numbers in the right range it's about twice as fast as the Taylor Series routine at about the same precision. But that doesn't change the conclusion that, without vgather instruction, the T.S. should be better for SIMD. It should be able to do 4 numbers at once (or, 8 with AVX) and therefore still be twice (or more) as fast as the LUT - I hope. I'll let u know when finished. [edit] qword's posting may be just what I need ...

However you do have a couple bugs - which probably never happen the way you use the routine. First, gives wrong answer at exactly pi/2 or 3pi/2. This can be fixed two ways. First, use truncate mode instead of the usual rounding mode (probably that's what you do). With normal rounding mode, it will round up to the 16,385th entry. So you can also put one more entry at the top of the table to catch it - gives exact right answer, 1, and only costs 1 byte. Both these fixes are shown below, FWIW.

Second, it doesn't handle larger numbers right. At 10,000 the answers are less precise by almost two digits; at 100,000 loses another digit; and fails entirely somewhere between there and 1,000,000. Basic reason (I figure), you convert real number to integer before truncating down to the first quadrant. If you reduce to the first quadrant using FPU, and only then convert to integer, it avoids this problem; but that's a lot slower. I have to handle larger numbers all the time, so do it that way; probably you don't (or would have noticed this problem). Not sure best way to fix it (or even if u need to fix it), so I left it alone.

Here's the slightly modified LUT as I tested it:

Code Select

; SSE Sin Cos routine by Siekmanski 2010.
include \masm32\include\masm32rt.inc
     .686p
     .xmm 

; minor mods rrr314159 2015/03/25

FLT4 MACRO float_number:REQ
LOCAL float_num
    .data
     align 4
     float_num real4 float_number
    .code
    EXITM <float_num>
ENDM

.data?
align 16
sseSinTable real4 1 + 65536/4 dup (?)          ; rrr, add one more entry at top

temp4 real4 ?                                  ; rrr, for going from FPU to xmm

.data
align 16
NegSP			dd	080000000h,000000000h,000000000h,000000000h

.code

align 4
InitSinTable proc public uses edi
LOCAL PI2:real4
LOCAL SinCounter:real4
	
    finit
    fclex
    
;    call initFPUsettruncatemode                ; rrr, set truncate mode or else add one more entry at top
                                                ; I added the entry because it's a little more accurate
	fld		FLT4(0.0)
	fstp	SinCounter
	fldpi
	fadd	st(0),st(0)
	fstp	PI2
	lea		edi,sseSinTable
	mov		ecx,65536/4
sin_lp:
	fld		SinCounter
	fdiv	FLT4(65536.0)
	fmul	PI2
	fsin
	fstp	real4 ptr[edi]
	fld		SinCounter
	fadd	FLT4(1.0)
	fstp	SinCounter
	add		edi,4
	dec		ecx
;	jnz		sin_lp
	jge		sin_lp                    ; rrr, jge instead of jnz to add entry at top
	ret

InitSinTable endp


align 16
SSE_Sin proc ; in: xmm0, out: xmm0 with sin value 

; modified rrr314159 2015/03/25 to use st(0) in and out of xmm0

fstp temp4
movss xmm0, temp4

	mulss		xmm0,FLT4(10430.3783504704527249495663163) ; 65536 / PI2
	cvttss2si	eax,xmm0

	and			eax,65535
	cmp			eax,16384-1
	ja			Part2
	movss		xmm0,real4 ptr [sseSinTable+eax*4]
	jmp rrrret
Part2:
	cmp			eax,16384*2-1
	ja			Part3
	mov			edx,16384*2
	sub			edx,eax
	movss		xmm0,real4 ptr [sseSinTable+edx*4]
	jmp rrrret
Part3:
	cmp			eax,16384*3-1
	ja			Part4		
	sub			eax,16384*2
	movss		xmm0,real4 ptr [sseSinTable+eax*4]
	xorps		xmm0,oword ptr NegSP
	jmp rrrret
Part4:
	mov			edx,16384*4
	sub			edx,eax	
	movss		xmm0,real4 ptr [sseSinTable+edx*4]
	xorps		xmm0,oword ptr NegSP

rrrret:
    movss temp4, xmm0
    fld temp4
    ret

 SSE_Sin endp	


align 16
SSE_Cos proc	; in: xmm0, out: xmm0 with cos value 

; modified rrr314159 2015/03/25 to use st(0) in and out of xmm0

fstp temp4
movss xmm0, temp4

	mulss		xmm0,FLT4(10430.3783504704527249495663163) ; 65536 / PI2
	cvttss2si	eax,xmm0
	add			eax,16384
	and			eax,65535
	cmp			eax,16384-1
	ja			Part2
	movss		xmm0,real4 ptr [sseSinTable+eax*4]
	jmp rrrret
Part2:
	cmp			eax,16384*2-1
	ja			Part3
	mov			edx,16384*2
	sub			edx,eax
	movss		xmm0,real4 ptr [sseSinTable+edx*4]
	jmp rrrret
Part3:
	cmp			eax,16384*3-1
	ja			Part4		
	sub			eax,16384*2
	movss		xmm0,real4 ptr [sseSinTable+eax*4]
	xorps		xmm0,oword ptr NegSP
	jmp rrrret
Part4:
	mov			edx,16384*4
	sub			edx,eax	
	movss		xmm0,real4 ptr [sseSinTable+edx*4]
	xorps		xmm0,oword ptr NegSP

rrrret:
    movss temp4, xmm0
    fld temp4
    ret

 SSE_Cos endp	

; «««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««
.data?
    thecontrolword dw ?
.code
; «««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««
initFPUsettruncatemode:
    finit
    fstcw thecontrolword
    fwait
    mov   ax,thecontrolword
    or    ax,0400h  ;this will set both bits of the RC field to the truncating mode
                    ; rounding down to -infinity as math INT function

    mov thecontrolword, ax
    fldcw thecontrolword     ;load the modified Control Word
ret

end

Here are a few test runs to illustrate the LUT's 2X higher speed, relative to T.S., and also the large-number problem:

Code Select

WITH NUMBERS NEAR 0.0

    ----------------------------------------
FPU fsin nanos per iter         31.842
FPU fcos nanos per iter         31.906

Test Function: SIEKMANSKI========================================

********** SIN Using faster version **********
Nanoseconds per Iteration       1.846
Speed ratio FPU/test fn         17.3

Precision
average precision               3.09e-005
worst precision                 9.25e-005		; GOOD PRECISION

********** COS Using faster version **********
Nanoseconds per Iteration       1.822
Speed ratio FPU/test fn         17.5

Precision
average precision               3.09e-005
worst precision                 9.34e-005

Test Function: trigC ========================================

********** SIN Using faster version **********
Nanoseconds per Iteration       3.562
Speed ratio FPU/test fn         8.95

Precision
average precision               9.98e-005
worst precision                 8.27e-004

********** COS Using faster version **********
Nanoseconds per Iteration       3.317
Speed ratio FPU/test fn         9.61

Precision
average precision               1.00e-004
worst precision                 8.94e-004

**************************************************************
WITH NUMBERS NEAR 10000.0

    ----------------------------------------
FPU fsin nanos per iter         31.873
FPU fcos nanos per iter         31.955

Test Function: SIEKMANSKI========================================

********** SIN Using faster version **********
Nanoseconds per Iteration       1.808
Speed ratio FPU/test fn         17.7

Precision
average precision               5.46e-004
worst precision                 1.42e-003

********** COS Using faster version **********
Nanoseconds per Iteration       1.810
Speed ratio FPU/test fn         17.6

Precision
average precision               5.45e-004
worst precision                 1.36e-003		; LOSING PRECISION

Test Function: trigC ========================================

********** SIN Using faster version **********
Nanoseconds per Iteration       3.548
Speed ratio FPU/test fn         9

Precision
average precision               1.00e-004
worst precision                 8.92e-004

********** COS Using faster version **********
Nanoseconds per Iteration       3.555
Speed ratio FPU/test fn         8.98

Precision
average precision               9.98e-005
worst precision                 8.28e-004

**************************************************************
WITH NUMBERS NEAR 100000.0

    ----------------------------------------
FPU fsin nanos per iter         31.800
FPU fcos nanos per iter         31.895

Test Function: SIEKMANSKI========================================

********** SIN Using faster version **********
Nanoseconds per Iteration       1.806
Speed ratio FPU/test fn         17.6

Precision
average precision               5.33e-003
worst precision                 1.42e-002		; LOSING PRECISION

********** COS Using faster version **********
Nanoseconds per Iteration       1.739
Speed ratio FPU/test fn         18.3

Precision
average precision               5.16e-003
worst precision                 1.41e-002

Test Function: trigC ========================================

********** SIN Using faster version **********
Nanoseconds per Iteration       3.518
Speed ratio FPU/test fn         9.05

Precision
average precision               9.99e-005
worst precision                 8.74e-004

********** COS Using faster version **********
Nanoseconds per Iteration       3.536
Speed ratio FPU/test fn         9.01

Precision
average precision               9.98e-005
worst precision                 8.46e-004

dedndave · March 28, 2015, 12:21:32 AM

Quote from: rrr314159 on March 27, 2015, 09:17:17 PM
@dedndave, you're suggesting (I think) that you and nidud had considerably longer timings...

no - our older processors seem to be faster than all the new ones

i took a quick look and noticed you are using the FPU for calculations
so, my theory about ignoring a high dword probably isn't the case

daydreamer · March 28, 2015, 04:59:22 AM

I read that pixelshader is capable of have 1 cycle lookup of sin and cos,anyone capable of writing pixelshader that outputs LUT to system memory?
its earlier ddraw that doesnt work on newer OS,win xp tops ,actually should be wieved in 8bit color palette, sorry cant find the latest source for it
this is very early version I found

Code Select

initcircle PROC
LOCAL @@stp :REAL4
LOCAL @@cnt :REAL4
    finit
    fldpi
    fidiv degrees
    fstp @@stp
    fldz
    fstp @@cnt
    mov eax,0
    mov ecx,[degrees]
    add ecx,ecx
    add ecx,ecx
@l1:fld @@cnt
    fsincos
    fstp REAL4 PTR [eax*8+sintab]
    fstp REAL4 PTR [eax*8+sintab+4]
    fld @@cnt
    fadd @@stp
    fstp @@cnt
    inc eax
    dec ecx
    jne @l1
    

    
ret
initcircle ENDP
circle2 PROC ;SSE version
    mov ecx,[degrees]
    add ecx,ecx
    
    lea edi,[buffer]
    MOVAPS XMM3,[xcentref]
    MOVAPS XMM2,[radius]
@l1:    MOVAPS XMM1,[ecx*8+sintab]
    MULPS XMM1,XMM2
    ADDPS XMM1,XMM3
    MOVAPS [tsse],XMM1
    ;CVTPS2PI MM1,XMM1
    ;MOVQ [tsse],MM1
    ;MOVHLPS XMM0,XMM1
    ;CVTPS2PI MM1,XMM0
    ;MOVQ [tsse+8],MM1
    fld tsse
    fistp tsse
    fld tsse+4
    fistp tsse+4
    fld tsse+8
    fistp tsse+8
    fld tsse+12
    fistp tsse+12
    mov edx,[tsse]
    sal edx,8 ;or whatever it needs to be scaled to
    add edx,[tsse+4]
    mov eax,0f0f0f0f0h
    mov [edi+edx],eax
    mov edx,[tsse+8]
    sal edx,8
    add edx,[tsse+12]
    mov [edi+edx],eax
        
    sub ecx,2
    jne @l1
    ret    
    

circle2 ENDP

here is my first concept of SSE rotation code

Code Select


    Xt = X*COS(í) - Y*SIN(í)
    Yt = X*SIN(í) + Y*COS(í)
SSE:MOVAPS XMM0,X4
	MOVAPS XMM2,Y4
	MOVAPS XMM1,XMM0
	MOVAPS XMM3,XMM2
	MULPS XMM0,COSTAB
	MULPS XMM1,SINTAB
	MULPS XMM2,SINTAB
	MULPS XMM3,COSTAB
	SUBPS XMM0,XMM2
	ADDPS XMM1,XMM3
	;
	MOVAPS XMM0,X4
	MOVAPS XMM1,Y4

and here is my second initial code designed with dx3d in mind,thats why I choosed XYZWUV vertex format ,use of smaller format like xyzw vertex format and you can exchange all slower movups with movaps

Code Select

movaps XMM0,XYZW
    movups XMM4,XYZW+24
    movaps XMM5,XYZW+48
    movups XMM6,XYZW+24*3
    movaps XMM7,XYZW+96
    movaps XMM1,xscale1cossin??? ;why not use a free scale of X?
    movaps XMM2,xscale2sincos???
    mulps XMM0,XMM1
    movaps ???ycosminuszsin???,XMM0
    mulps XMM4,XMM1
    movaps ???ycosminuszsin???+16,XMM4
    mulps XMM5,XMM1
    movaps XMM0,XYZW ;load again how many clock cycles after store?
    movaps ???ycosminuszsin???+32,XMM5
    mulps XMM6,XMM1
    movaps ???ycosminuszsin???+48,XMM6
    mulps XMM7,XMM1
    movaps ???ycosminuszsin???+64,XMM7
    mulps XMM2,XMM0
    movaps ???ycosminuszsin???,XMM0
    movaps ???ysinpluszcos???,XMM2
    ;movss/subss/movss and movss/addss/movss section
    		  ;after
    movss X,XMM1  
    movss XMM3,Ycos ;now if I input a reg for example eax to this proc for 
    subss XMM3,zsin ;what change what axis I gonna rotate
    movss Y,XMM3
    movss XMM4,ysin
    addss XMM4,zcos
    movss Z,XMM4
	data XYZWUV ;24byte w UV coordinates
	data coscossindummy ;unused in rotate can be used to scale variable
	data sinsincosdummy
	data xcosycosminuszsindummy 
	data xcosycosminuszsindummy

rrr314159 · March 28, 2015, 05:30:20 AM

Quote from: siekmanskiYou need a lot of shuffles to do 4 or 8 at once.

My case is the opposite: The numbers are being handed to me already packed in a xmm or ymm register. So I'd have to do "a lot of shuffles" to use the LUT. But your routine is so fast, it almost looks worthwhile to unpack the data, apply the LUT one by one, and repack it!

These SIMD instructions always seem to be missing the one instruction u need to make it work (like, integer arithmetic, or vgather). And, they're often disappointingly slow. Nevertheless, they do give a big advantage in straight floating point math performed on data already lined up in memory.

Quote from: dedndavei took a quick look and noticed you are using the FPU for calculations
so, my theory about ignoring a high dword probably isn't the case

That's right: I'm putting rdtsc output into a real8 right away, no "long integer". Better if you're doing a lot of calculations. Plus my routines can be stuck into "live" code very easily - that's why I prefer them. U know, it's worth noting that in another thread jj2007's "bible" routine wasn't working on your old machine ... so, at the moment I suspect it's something to do with that. (I'm hoping to goad u into figuring out the problem, to defend your computer's reputation :)

@Antariy, thanks for your timing numbers, T.S. routines are slow on those older machines, maybe I should just wait a few years until everybody replaces them with i7's like siekmanski's :)

@qword,

I understand you're not endorsing this SSE algo, just putting it out as an example, and as such it's very useful, gives me a baseline to compare to, so thanks! But I'll bet it can be done better!

First, he's got too many coefficients: the x^15 term must be meaningless, given real8 accuracy. Also (I think) he's mult'ing them the wrong way: starts with smallest terms, up to largest. That means the x^13 coeff (at 10^-10) gets added 6 times, mult'ed 7 times! - when u only need to mult it once, and add. That term will be lost in the noise.

[edit] Got the algo working and I was extremely wrong. Those two terms both add 2 digits precision - and cost 10% in time (very close). So the algo as written has about 12 digits precision, and is only 1.2 times faster than FPU; take off last term, get 10 digits, but now it's about 1.32 faster. Very interesting, thanks again qWord

Also his quadrant determination looks bad - u only need 2 comparisons - he's even got 3 branches one after the other, a no-no (according to Agner Fog). I have a trick that gets it down to only one comparison (altho I'm pretty sure the overhead makes it not worthwhile). (I think) there are other problems ... but let me work on it more, I could be wrong about some (or all) of these opinions, haven't actually tested them yet.

@daydreamer2, Yes I've heard the same thing about pixelshader and other GPGPU floating point units, would love to take advantage of it. ... thx for the code!

Siekmanski · March 28, 2015, 08:01:53 PM

Hi, rrr314159

QuoteHowever you do have a couple bugs - which probably never happen the way you use the routine. First, gives wrong answer at exactly pi/2 or 3pi/2. This can be fixed two ways. First, use truncate mode instead of the usual rounding mode (probably that's what you do). With normal rounding mode, it will round up to the 16,385th entry. So you can also put one more entry at the top of the table to catch it - gives exact right answer, 1, and only costs 1 byte. Both these fixes are shown below, FWIW.

You are right, didn't noticed this rounding error.

QuoteSecond, it doesn't handle larger numbers right. At 10,000 the answers are less precise by almost two digits; at 100,000 loses another digit; and fails entirely somewhere between there and 1,000,000. Basic reason (I figure), you convert real number to integer before truncating down to the first quadrant. If you reduce to the first quadrant using FPU, and only then convert to integer, it avoids this problem; but that's a lot slower. I have to handle larger numbers all the time, so do it that way; probably you don't (or would have noticed this problem). Not sure best way to fix it (or even if u need to fix it), so I left it alone.

Never used numbers larger than 2pi, but i'll have a look into this larger number issue.

Siekmanski · March 29, 2015, 01:12:43 AM

Rounding mistake solved:

Code Select

.data?
align 16
sseSinTable     real4 65536/4 dup (?)

.data
align 4
NegSP           dd 080000000h
QuadrantJump    dd offset Quadrant1,offset Quadrant2,offset Quadrant3,offset Quadrant4

.code

align 4
InitSinTable proc uses edi
LOCAL PI2:real4
LOCAL SinCounter:real4
    
    finit
    fclex

    fld     FLT4(0.5)
    fstp    SinCounter
    fldpi
    fadd    st(0),st(0)
    fstp    PI2
    lea     edi,sseSinTable
    mov     ecx,65536/4
sin_lp:
    fld     SinCounter
    fmul    PI2
    fdiv    FLT4(65536.0)    
    fsin
    fstp    real4 ptr[edi]
    fld     SinCounter
    fadd    FLT4(1.0)
    fstp    SinCounter
    add     edi,4
    dec     ecx
    jnz     sin_lp
    ret
InitSinTable endp

align 4
SSE_Sin proc
    mulss       xmm0,FLT4(10430.2191955273608296137974326)  ; 65535 / PI2 
    cvttss2si   eax,xmm0
    and         eax,65535
    mov         edx,eax
    shr         edx,14
    jmp         [QuadrantJump+edx*4]
SSE_Sin endp

align 4
SSE_Cos proc
    mulss       xmm0,FLT4(10430.2191955273608296137974326)  ; 65535 / PI2 
    cvttss2si   eax,xmm0
    add         eax,16384
    and         eax,65535
    mov         edx,eax
    shr         edx,14
    jmp         [QuadrantJump+edx*4]
SSE_Cos endp

align 4
Quadrant1:
    movss       xmm0,real4 ptr [sseSinTable+eax*4]
    ret
align 4
Quadrant2:
    mov         edx,16384*2
    sub         edx,eax
    movss       xmm0,real4 ptr [sseSinTable+edx*4]
    ret
align 4
Quadrant3:
    movss       xmm0,real4 ptr [sseSinTable-(16384*2*4)+eax*4]
    xorps       xmm0,real4 ptr NegSP
    ret
align 4
Quadrant4:
    mov         edx,16384*4
    sub         edx,eax 
    movss       xmm0,real4 ptr [sseSinTable+edx*4]
    xorps       xmm0,real4 ptr NegSP
    ret

Next, attack larger number issue.

The MASM Forum

News:

Trigonometry ...

Antariy

dedndave

Antariy

Siekmanski

qWord

daydreamer

Siekmanski

Siekmanski

MichaelW

rrr314159

dedndave

daydreamer

rrr314159

Siekmanski

Siekmanski