Hi all,
@qword, thanks for catching that, it's not supposed to be fisttp but fistp. Works the same for positive numbers but for negative it needs fistp with truncate (down to -inf) rounding mode, then the precision remains unchanged. I'll post a fix soon. I'll also take a look at your SSE algo ... the precision at 10^-11 is much better than I need, but more never hurts.
@dedndave, you're suggesting (I think) that you and nidud had considerably longer timings, so somewhere a multiplication that worked OK on my machine is overflowing into the high word on yours - which I'm neglecting to check. Makes sense, if so be easy to fix. I'll take a look ...
@MichaelW, rdtsc works fine on my machines, even with variable clock - I get the exact same numbers as with perf counter (but with 3 extra digits of course). Perhaps I need to check it for longer times, I rarely go above 10 seconds or so - or just look more carefully ;) It's often important to get nanosecond timing, so I can't abandon rdtsc. Therefore I've made two versions of timing routines, using each. For this project, as it turned out I could even use your timers.asm; as posted, I'm only timing one big loop. When I started however I was timing many short segments, doing lots of nanosecond calculations, and not looping; so needed my routines, and just left them in final version. If I get around to it I'll replace them.
@siekmanski, here's what I wrote b4 reading recent posts,
Your LUT technique is better than what I was doing in a couple of ways, checking for quadrants is simpler and faster, so I'll borrow that - thanks! [edit] - I see you've posted an "even better" jump table version, I'll take a look ...
When fed numbers in the right range it's about twice as fast as the Taylor Series routine at about the same precision. But that doesn't change the conclusion that, without vgather instruction, the T.S. should be better for SIMD. It should be able to do 4 numbers at once (or, 8 with AVX) and therefore still be twice (or more) as fast as the LUT - I hope. I'll let u know when finished. [edit] qword's posting may be just what I need ...
However you do have a couple bugs - which probably never happen the way you use the routine. First, gives wrong answer at exactly pi/2 or 3pi/2. This can be fixed two ways. First, use truncate mode instead of the usual rounding mode (probably that's what you do). With normal rounding mode, it will round up to the 16,385th entry. So you can also put one more entry at the top of the table to catch it - gives exact right answer, 1, and only costs 1 byte. Both these fixes are shown below, FWIW.
Second, it doesn't handle larger numbers right. At 10,000 the answers are less precise by almost two digits; at 100,000 loses another digit; and fails entirely somewhere between there and 1,000,000. Basic reason (I figure), you convert real number to integer before truncating down to the first quadrant. If you reduce to the first quadrant using FPU, and only then convert to integer, it avoids this problem; but that's a lot slower. I have to handle larger numbers all the time, so do it that way; probably you don't (or would have noticed this problem). Not sure best way to fix it (or even if u need to fix it), so I left it alone.
Here's the slightly modified LUT as I tested it:
; SSE Sin Cos routine by Siekmanski 2010.
include \masm32\include\masm32rt.inc
.686p
.xmm
; minor mods rrr314159 2015/03/25
FLT4 MACRO float_number:REQ
LOCAL float_num
.data
align 4
float_num real4 float_number
.code
EXITM <float_num>
ENDM
.data?
align 16
sseSinTable real4 1 + 65536/4 dup (?) ; rrr, add one more entry at top
temp4 real4 ? ; rrr, for going from FPU to xmm
.data
align 16
NegSP dd 080000000h,000000000h,000000000h,000000000h
.code
align 4
InitSinTable proc public uses edi
LOCAL PI2:real4
LOCAL SinCounter:real4
finit
fclex
; call initFPUsettruncatemode ; rrr, set truncate mode or else add one more entry at top
; I added the entry because it's a little more accurate
fld FLT4(0.0)
fstp SinCounter
fldpi
fadd st(0),st(0)
fstp PI2
lea edi,sseSinTable
mov ecx,65536/4
sin_lp:
fld SinCounter
fdiv FLT4(65536.0)
fmul PI2
fsin
fstp real4 ptr[edi]
fld SinCounter
fadd FLT4(1.0)
fstp SinCounter
add edi,4
dec ecx
; jnz sin_lp
jge sin_lp ; rrr, jge instead of jnz to add entry at top
ret
InitSinTable endp
align 16
SSE_Sin proc ; in: xmm0, out: xmm0 with sin value
; modified rrr314159 2015/03/25 to use st(0) in and out of xmm0
fstp temp4
movss xmm0, temp4
mulss xmm0,FLT4(10430.3783504704527249495663163) ; 65536 / PI2
cvttss2si eax,xmm0
and eax,65535
cmp eax,16384-1
ja Part2
movss xmm0,real4 ptr [sseSinTable+eax*4]
jmp rrrret
Part2:
cmp eax,16384*2-1
ja Part3
mov edx,16384*2
sub edx,eax
movss xmm0,real4 ptr [sseSinTable+edx*4]
jmp rrrret
Part3:
cmp eax,16384*3-1
ja Part4
sub eax,16384*2
movss xmm0,real4 ptr [sseSinTable+eax*4]
xorps xmm0,oword ptr NegSP
jmp rrrret
Part4:
mov edx,16384*4
sub edx,eax
movss xmm0,real4 ptr [sseSinTable+edx*4]
xorps xmm0,oword ptr NegSP
rrrret:
movss temp4, xmm0
fld temp4
ret
SSE_Sin endp
align 16
SSE_Cos proc ; in: xmm0, out: xmm0 with cos value
; modified rrr314159 2015/03/25 to use st(0) in and out of xmm0
fstp temp4
movss xmm0, temp4
mulss xmm0,FLT4(10430.3783504704527249495663163) ; 65536 / PI2
cvttss2si eax,xmm0
add eax,16384
and eax,65535
cmp eax,16384-1
ja Part2
movss xmm0,real4 ptr [sseSinTable+eax*4]
jmp rrrret
Part2:
cmp eax,16384*2-1
ja Part3
mov edx,16384*2
sub edx,eax
movss xmm0,real4 ptr [sseSinTable+edx*4]
jmp rrrret
Part3:
cmp eax,16384*3-1
ja Part4
sub eax,16384*2
movss xmm0,real4 ptr [sseSinTable+eax*4]
xorps xmm0,oword ptr NegSP
jmp rrrret
Part4:
mov edx,16384*4
sub edx,eax
movss xmm0,real4 ptr [sseSinTable+edx*4]
xorps xmm0,oword ptr NegSP
rrrret:
movss temp4, xmm0
fld temp4
ret
SSE_Cos endp
; «««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««
.data?
thecontrolword dw ?
.code
; «««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««
initFPUsettruncatemode:
finit
fstcw thecontrolword
fwait
mov ax,thecontrolword
or ax,0400h ;this will set both bits of the RC field to the truncating mode
; rounding down to -infinity as math INT function
mov thecontrolword, ax
fldcw thecontrolword ;load the modified Control Word
ret
end
Here are a few test runs to illustrate the LUT's 2X higher speed, relative to T.S., and also the large-number problem:
WITH NUMBERS NEAR 0.0
----------------------------------------
FPU fsin nanos per iter 31.842
FPU fcos nanos per iter 31.906
Test Function: SIEKMANSKI========================================
********** SIN Using faster version **********
Nanoseconds per Iteration 1.846
Speed ratio FPU/test fn 17.3
Precision
average precision 3.09e-005
worst precision 9.25e-005 ; GOOD PRECISION
********** COS Using faster version **********
Nanoseconds per Iteration 1.822
Speed ratio FPU/test fn 17.5
Precision
average precision 3.09e-005
worst precision 9.34e-005
Test Function: trigC ========================================
********** SIN Using faster version **********
Nanoseconds per Iteration 3.562
Speed ratio FPU/test fn 8.95
Precision
average precision 9.98e-005
worst precision 8.27e-004
********** COS Using faster version **********
Nanoseconds per Iteration 3.317
Speed ratio FPU/test fn 9.61
Precision
average precision 1.00e-004
worst precision 8.94e-004
**************************************************************
WITH NUMBERS NEAR 10000.0
----------------------------------------
FPU fsin nanos per iter 31.873
FPU fcos nanos per iter 31.955
Test Function: SIEKMANSKI========================================
********** SIN Using faster version **********
Nanoseconds per Iteration 1.808
Speed ratio FPU/test fn 17.7
Precision
average precision 5.46e-004
worst precision 1.42e-003
********** COS Using faster version **********
Nanoseconds per Iteration 1.810
Speed ratio FPU/test fn 17.6
Precision
average precision 5.45e-004
worst precision 1.36e-003 ; LOSING PRECISION
Test Function: trigC ========================================
********** SIN Using faster version **********
Nanoseconds per Iteration 3.548
Speed ratio FPU/test fn 9
Precision
average precision 1.00e-004
worst precision 8.92e-004
********** COS Using faster version **********
Nanoseconds per Iteration 3.555
Speed ratio FPU/test fn 8.98
Precision
average precision 9.98e-005
worst precision 8.28e-004
**************************************************************
WITH NUMBERS NEAR 100000.0
----------------------------------------
FPU fsin nanos per iter 31.800
FPU fcos nanos per iter 31.895
Test Function: SIEKMANSKI========================================
********** SIN Using faster version **********
Nanoseconds per Iteration 1.806
Speed ratio FPU/test fn 17.6
Precision
average precision 5.33e-003
worst precision 1.42e-002 ; LOSING PRECISION
********** COS Using faster version **********
Nanoseconds per Iteration 1.739
Speed ratio FPU/test fn 18.3
Precision
average precision 5.16e-003
worst precision 1.41e-002
Test Function: trigC ========================================
********** SIN Using faster version **********
Nanoseconds per Iteration 3.518
Speed ratio FPU/test fn 9.05
Precision
average precision 9.99e-005
worst precision 8.74e-004
********** COS Using faster version **********
Nanoseconds per Iteration 3.536
Speed ratio FPU/test fn 9.01
Precision
average precision 9.98e-005
worst precision 8.46e-004