Trigonometry ...

Siekmanski · April 03, 2015, 10:00:24 AM

Quote from: rrr314159 on April 03, 2015, 03:02:28 AM
... It's the problem I've mentioned often, all my routines use "truncate down to -inf" rounding mode, but everybody else uses the default rounding mode, and that's what you're doing here. So the Taylor Series should also give wrong results for negative numbers, some other borderline cases too, where it rounds in the wrong direction.

Didn't get it at first, but now i do. --> ("truncate down to -inf" rounding mode) I'm just a simple dutch guy you know.

Solved "wrong results for negative numbers" for the the default rounding mode in my previous post.

Siekmanski · April 03, 2015, 10:59:49 AM

Did a test with this fpu routine, it's slightly faster than the SSE2 routine on my PC.

Code Select

FPUsin_Lut proc
    fld     st(0)
    fmul    FLT8(0.15915494309189533576888376337)
    fistp   real8 ptr [esp-8]
    mov     eax,dword ptr [esp-8]
    fild    real8 ptr [esp-8]
    fmul    FLT8(6.28318530717958647692528676656)
    fsub
    fmul    FLT8(10430.2191955273608296137974326)
    fistp   dword ptr [esp-4]
    mov     eax,dword ptr [esp-4]
    cmp     eax,-1
    jg      greater
    sub     eax,1   
greater:
    and     eax,65535
    mov     edx,eax
    shr     edx,14
    jnz     Q2
    fld     real4 ptr [SSE2SinTableSP+eax*4]
    ret
Q2: cmp     edx,2
    je      Q3
    ja      Q4
    mov     edx,16384*2
    sub     edx,eax
    fld     real4 ptr [SSE2SinTableSP+edx*4-4]  
    ret
Q3: fld     real4 ptr [SSE2SinTableSP-(16384*2*4)+eax*4]
    fchs
    ret
Q4: mov     edx,16384*4
    sub     edx,eax 
    fld     real4 ptr [SSE2SinTableSP+edx*4-4]
    fchs
    ret
FPUsin_Lut endp

EDIT: Noticed that it is much faster in quadrant 1 and quadrant 2, didn't know "fchs" is such an expensive one. ( see if we can do without it...... )

rrr314159 · April 03, 2015, 11:31:00 AM

While I was typing your last post appeared, I'll take a look, re. previous post,

Quote from: Siekmanski on April 03, 2015, 10:00:24 AM...("truncate down to -inf" rounding mode)...

- I guess the official name for it is "floor", and "truncate" usually means "truncate towards 0" and ceiling, "truncate towards +inf". But there's some confusion, in my school days "int" used to mean either floor or truncate depending on the mood ;) and "floor" and "ceiling" were never heard ... today with the influence of computers I think both those terms are accepted, and int is left with the meaning "towards 0". But we never called any of these "rounding" (which is "nearest integer"), much less "rounding modes". This potential confusion is why I always spell it out, "truncate towards ... -inf, zero, +inf, or nearest integer".

Quote from: Siekmanski on April 03, 2015, 10:00:24 AMI'm just a simple dutch guy you know.

- Yeh right - Gerard 't Hooft is (or, was?) Dutch! Maybe Gunther understands renormalizing spontaneously broken non-abelian gauge theories but not many do, and t'Hooft invented it age 19 (or whatever). Not to mention Huygens, Kuiper, Oort ... and a personal favorite of mine, not such a big name, Bart Bok, the only guy who understood the Milky Way - at least, the only one who could communicate that understanding. You need a better excuse!

Not that u need an excuse, I'm still baffled by 3d d3d9 ... intend to get around to it someday, thanks for that library

jj2007 · April 03, 2015, 11:32:54 AM

Quote from: Siekmanski on April 03, 2015, 10:59:49 AMdidn't know "fchs" is such an expensive one.

Shouldn't be more than 2 cycles:

Code Select

Intel(R) Core(TM) i5-2450M CPU @ 2.50GHz (SSE4)

2023    cycles for 1000 * fchs
2953    cycles for 1000 * 0-x

2022    cycles for 1000 * fchs
2952    cycles for 1000 * 0-x

2023    cycles for 1000 * fchs
2958    cycles for 1000 * 0-x

2       bytes for fchs
4       bytes for 0-x

Siekmanski · April 03, 2015, 12:21:33 PM

Hi rrr314159,

QuoteI'm still baffled by 3d d3d9 ... intend to get around to it someday, thanks for that library.

You're welcome, we could use some fast trigonometry functions for that d3d lib.

Hi Jochen,

You are right.
Did replace fchs with this piece of code and it didn't speed things up.

fstp real4 ptr [esp-4]
xor real4 ptr [esp-4],80000000h
fld real4 ptr [esp-4]

rrr314159 · April 03, 2015, 01:52:41 PM

Quote from: siekmanskiDid a test with this fpu routine, it's slightly faster than the SSE2 routine on my PC

- That's what I'm finding also. Isn't the slight advantage only because your inputs are in FPU so for SSE, must convert them? If inputs are originally prepared in SSE register, I think the advantage then goes the other way.

And, changing sign is faster for SSE, because can use the xor 80000000h trick without converting, as you must with FPU. So maybe SSE is the better way to go (altho FPU still has that 80-bit intermediate calculation advantage)

jj2007 · April 03, 2015, 06:09:12 PM

Quote from: Siekmanski on April 03, 2015, 12:21:33 PM
Did replace fchs with this piece of code and it didn't speed things up.

Same for neg:

Code Select

Intel(R) Core(TM) i5-2450M CPU @ 2.50GHz (SSE4)

2034    cycles for 1000 * fchs
2979    cycles for 1000 * 0-x
13219   cycles for 1000 * xor
13187   cycles for 1000 * neg

2036    cycles for 1000 * fchs
2968    cycles for 1000 * 0-x
13214   cycles for 1000 * xor
13211   cycles for 1000 * neg

So it's not fchs that slows things down. Try "hiding" it between non-FPU instructions.

Siekmanski · April 03, 2015, 09:32:27 PM

Hi Jochen,

QuoteTry "hiding" it between non-FPU instructions.

No place to hide, it's at the very end of the routine.

Here are the timings for all routines.
Measuring the minimum and maximum time each routine takes. ( quadrant 1 and quadrant 4 )
One minor correction in the code. Changed cvttsd2si --> cvtsd2si (left it whilst testing.)

Conclusion: The fpu is faster than SSE ( at least on my PC )

I'm very curious of the results from you guys....

Code Select

SSE2 Sin Cos Tan routines by Siekmanski 2015.

fsin:    0.19866933 ( FPU )
SSE2sin: 0.19866522 ( SSE2 lut )
FPUsin:  0.19866522 ( FPU lut )

fcos:    0.98006658 ( FPU )
SSE2cos: 0.98005313 ( SSE2 lut )
FPUcos:  0.98005313 ( FPU lut )

ftan:    0.20271004 ( FPU )
SSE2tan: 0.20270567 ( SSE2 lut )
FPUtan:  0.20270567 ( FPU lut )

Start timing, please wait for 5 seconds......

176  cycles SSE2sin Quadrant 1
178  cycles SSE2sin Quadrant 4

131  cycles FPUsin Quadrant 1
176  cycles FPUsin Quadrant 4

159  cycles FPUsin from SSE to FPU to SSE Quadrant 1
215  cycles FPUsin from SSE to FPU to SSE Quadrant 4

187  cycles SSE2cos Quadrant 1
193  cycles SSE2cos Quadrant 4

147  cycles FPUcos Quadrant 1
185  cycles FPUcos Quadrant 4

178  cycles FPUcos from SSE to FPU to SSE Quadrant 1
216  cycles FPUcos from SSE to FPU to SSE Quadrant 4

183  cycles SSE2tan Quadrant 1
184  cycles SSE2tan Quadrant 2

132  cycles FPUtan Quadrant 1
148  cycles FPUtan Quadrant 2

157  cycles FPUtan from SSE to FPU to SSE Quadrant 1
181  cycles FPUtan from SSE to FPU to SSE Quadrant 2


Start FPU timing... please wait, this may take a while ......

7130 cycles fsin (FPU)
7130 cycles fcos (FPU)
7356 cycles ftan (FPU)

Press any key to continue...

Gunther · April 03, 2015, 09:46:55 PM

Hi Marinus,

Quote from: Siekmanski on April 03, 2015, 09:32:27 PM
I'm very curious of the results from you guys....

here they are:

Code Select


SSE2 Sin Cos Tan routines by Siekmanski 2015.

fsin:    0.19866933 ( FPU )
SSE2sin: 0.19866522 ( SSE2 lut )
FPUsin:  0.19866522 ( FPU lut )

fcos:    0.98006658 ( FPU )
SSE2cos: 0.98005313 ( SSE2 lut )
FPUcos:  0.98005313 ( FPU lut )

ftan:    0.20271004 ( FPU )
SSE2tan: 0.20270567 ( SSE2 lut )
FPUtan:  0.20270567 ( FPU lut )

Start timing, please wait for 5 seconds......

162  cycles SSE2sin Quadrant 1
170  cycles SSE2sin Quadrant 4

117  cycles FPUsin Quadrant 1
155  cycles FPUsin Quadrant 4

139  cycles FPUsin from SSE to FPU to SSE Quadrant 1
185  cycles FPUsin from SSE to FPU to SSE Quadrant 4

165  cycles SSE2cos Quadrant 1
167  cycles SSE2cos Quadrant 4

130  cycles FPUcos Quadrant 1
163  cycles FPUcos Quadrant 4

155  cycles FPUcos from SSE to FPU to SSE Quadrant 1
191  cycles FPUcos from SSE to FPU to SSE Quadrant 4

163  cycles SSE2tan Quadrant 1
162  cycles SSE2tan Quadrant 2

117  cycles FPUtan Quadrant 1
130  cycles FPUtan Quadrant 2

138  cycles FPUtan from SSE to FPU to SSE Quadrant 1
157  cycles FPUtan from SSE to FPU to SSE Quadrant 2


Start FPU timing... please wait, this may take a while ......

6271 cycles fsin (FPU)
6272 cycles fcos (FPU)
6469 cycles ftan (FPU)

Press any key to continue...

Gunther

jj2007 · April 03, 2015, 09:47:58 PM

Core i5:

Code Select

SSE2 Sin Cos Tan routines by Siekmanski 2015.

fsin:    0.19866933 ( FPU )
SSE2sin: 0.19866522 ( SSE2 lut )
FPUsin:  0.19866522 ( FPU lut )

fcos:    0.98006658 ( FPU )
SSE2cos: 0.98005313 ( SSE2 lut )
FPUcos:  0.98005313 ( FPU lut )

ftan:    0.20271004 ( FPU )
SSE2tan: 0.20270567 ( SSE2 lut )
FPUtan:  0.20270567 ( FPU lut )

Start timing, please wait for 5 seconds......

146  cycles SSE2sin Quadrant 1
165  cycles SSE2sin Quadrant 4

117  cycles FPUsin Quadrant 1
154  cycles FPUsin Quadrant 4

139  cycles FPUsin from SSE to FPU to SSE Quadrant 1
181  cycles FPUsin from SSE to FPU to SSE Quadrant 4

151  cycles SSE2cos Quadrant 1
169  cycles SSE2cos Quadrant 4

132  cycles FPUcos Quadrant 1
160  cycles FPUcos Quadrant 4

157  cycles FPUcos from SSE to FPU to SSE Quadrant 1
185  cycles FPUcos from SSE to FPU to SSE Quadrant 4

152  cycles SSE2tan Quadrant 1
163  cycles SSE2tan Quadrant 2

116  cycles FPUtan Quadrant 1
131  cycles FPUtan Quadrant 2

139  cycles FPUtan from SSE to FPU to SSE Quadrant 1
158  cycles FPUtan from SSE to FPU to SSE Quadrant 2


Start FPU timing... please wait, this may take a while ......

5925 cycles fsin (FPU)
5921 cycles fcos (FPU)
6106 cycles ftan (FPU)

Siekmanski · April 03, 2015, 09:51:25 PM

Thank you Gunther and Jochen. :t

rrr314159 · April 03, 2015, 10:07:15 PM

Similar to everyone else except fsin, fcos, ftan. AMD's ftan more than 10X faster than Intel!

Code Select

AMD A6 1.8 Ghz

SSE2 Sin Cos Tan routines by Siekmanski 2015.

fsin:    0.19866933 ( FPU )
SSE2sin: 0.19866522 ( SSE2 lut )
FPUsin:  0.19866522 ( FPU lut )

fcos:    0.98006658 ( FPU )
SSE2cos: 0.98005313 ( SSE2 lut )
FPUcos:  0.98005313 ( FPU lut )

ftan:    0.20271004 ( FPU )
SSE2tan: 0.20270567 ( SSE2 lut )
FPUtan:  0.20270567 ( FPU lut )

Start timing, please wait for 5 seconds......

215  cycles SSE2sin Quadrant 1
258  cycles SSE2sin Quadrant 4

163  cycles FPUsin Quadrant 1
224  cycles FPUsin Quadrant 4

187  cycles FPUsin from SSE to FPU to SSE Quadrant 1
261  cycles FPUsin from SSE to FPU to SSE Quadrant 4

218  cycles SSE2cos Quadrant 1
257  cycles SSE2cos Quadrant 4

162  cycles FPUcos Quadrant 1
238  cycles FPUcos Quadrant 4

202  cycles FPUcos from SSE to FPU to SSE Quadrant 1
269  cycles FPUcos from SSE to FPU to SSE Quadrant 4

218  cycles SSE2tan Quadrant 1
229  cycles SSE2tan Quadrant 2

162  cycles FPUtan Quadrant 1
218  cycles FPUtan Quadrant 2

189  cycles FPUtan from SSE to FPU to SSE Quadrant 1
239  cycles FPUtan from SSE to FPU to SSE Quadrant 2


Start FPU timing... please wait, this may take a while ......

874 cycles fsin (FPU)
907 cycles fcos (FPU)
560 cycles ftan (FPU)

Press any key to continue...

Siekmanski · April 03, 2015, 10:26:39 PM

Hi, rrr314159

Funny that ftan is faster than fsin and fcos on your machine.
Or it must be that fsincos is that fast, because i calculated ftan with fsincos and fdivp.

dedndave · April 03, 2015, 11:35:05 PM

P4 prescott w/htt

Code Select

SSE2 Sin Cos Tan routines by Siekmanski 2015.

fsin:    0.19866933 ( FPU )
SSE2sin: 0.19866522 ( SSE2 lut )
FPUsin:  0.19866522 ( FPU lut )

fcos:    0.98006658 ( FPU )
SSE2cos: 0.98005313 ( SSE2 lut )
FPUcos:  0.98005313 ( FPU lut )

ftan:    0.20271004 ( FPU )
SSE2tan: 0.20270567 ( SSE2 lut )
FPUtan:  0.20270567 ( FPU lut )

Start timing, please wait for 5 seconds......

591  cycles SSE2sin Quadrant 1
577  cycles SSE2sin Quadrant 4

387  cycles FPUsin Quadrant 1
596  cycles FPUsin Quadrant 4

624  cycles FPUsin from SSE to FPU to SSE Quadrant 1
705  cycles FPUsin from SSE to FPU to SSE Quadrant 4

549  cycles SSE2cos Quadrant 1
931  cycles SSE2cos Quadrant 4

543  cycles FPUcos Quadrant 1
604  cycles FPUcos Quadrant 4

662  cycles FPUcos from SSE to FPU to SSE Quadrant 1
719  cycles FPUcos from SSE to FPU to SSE Quadrant 4

592  cycles SSE2tan Quadrant 1
604  cycles SSE2tan Quadrant 2

387  cycles FPUtan Quadrant 1
536  cycles FPUtan Quadrant 2

623  cycles FPUtan from SSE to FPU to SSE Quadrant 1
662  cycles FPUtan from SSE to FPU to SSE Quadrant 2


Start FPU timing... please wait, this may take a while ......

20852 cycles fsin (FPU)
20837 cycles fcos (FPU)
20332 cycles ftan (FPU)

Siekmanski · April 03, 2015, 11:48:26 PM

Thanks Dave :t

The MASM Forum

News:

Trigonometry ...

Siekmanski

Siekmanski

rrr314159

jj2007

Siekmanski

rrr314159

jj2007

Siekmanski

Gunther

jj2007

Siekmanski

rrr314159

Siekmanski

dedndave

Siekmanski