Atan2 SSE2

jj2007 · March 02, 2022, 09:54:08 AM

Quote from: guga on March 02, 2022, 09:09:07 AM
Try this:

sin(x)+cos(x)

  FastMath SinPlusCos
	For_ ct=-1 To 360	; max 10,000 iterations
		fild ct	; X
		fld st
		fstp REAL10 ptr [edi]
		fmul FP8(0.01745329252)	; degrees->rad
		fsincos
		fadd
		fstp REAL10 ptr [edi+REAL10]
		add edi, 2*REAL10
	Next 
  FastMath

Code Select

NameA equ SinPlusCos FastMath
TestA proc
  mov ebx, AlgoLoops-1	; loop e.g. 100x
  xor ecx, ecx
  align 4
  .Repeat
	void SinPlusCos(ecx)	; uses FastMath
	fstp res8
	inc ecx
  .Until ecx>=360
  ret
TestA endp
TestA_endp:

align_64
TestB_s:
NameB equ SinPlusCos Fpu
TestB proc
  mov ebx, AlgoLoops-1	; loop e.g. 100x
  xor ecx, ecx
  align 4
  .Repeat
	push ecx
	fild stack
	pop ecx
	fmul FP8(0.01745329252)	; degrees->rad
	fsincos	; uses the FPU
	fadd
	fstp res8
	inc ecx
  .Until ecx>=360
  ret
TestB endp

Code Select

Intel(R) Core(TM) i5-2450M CPU @ 2.50GHz (SSE4)

6745    cycles for 100 * SinPlusCos FastMath
39833   cycles for 100 * SinPlusCos Fpu

6797    cycles for 100 * SinPlusCos FastMath
39854   cycles for 100 * SinPlusCos Fpu

6741    cycles for 100 * SinPlusCos FastMath
39812   cycles for 100 * SinPlusCos Fpu

6745    cycles for 100 * SinPlusCos FastMath
39815   cycles for 100 * SinPlusCos Fpu

186     bytes for SinPlusCos FastMath
27      bytes for SinPlusCos Fpu

Real8   0.9823952887398167411   SinPlusCos FastMath
Real8   0.9823952887398167411   SinPlusCos Fpu
exact   0.9823952887191077263

Now the question is, of course, how fast and how accurate is LolRemez

guga · March 02, 2022, 10:08:36 AM

Great work. But still uses a precalculate table, right ?

About Lolremez, i´m not sure how accurate it is. I didn´t tested the results on the example i gave. For what i saw so far, the accuracy depends on the amount of polynomials to be calculated. For example, using a 4th degree polyomial is less accurate than using a 9th degree. If we could build a routine that can be able to do the maths of those equations using the polynomials approximations (ChebyChev, LolRemez, MinMax etc etc), we can achieve a higher level of accuracy using more polynomials (and yet, this shouldn´t affect performance since we can do it in SSE2 using few instructions, without any loops).

This library seems to have a more accurate results, but didn´t tested yet.
https://www.boost.org/doc/libs/1_37_0/libs/math/doc/sf_and_dist/html/math_toolkit/using_udt/use_ntl.html

Or this one..seems to be even more accurated

https://www.boost.org/doc/libs/1_37_0/libs/math/doc/sf_and_dist/html/math_toolkit/backgrounders/lanczos.html

jj2007 · March 02, 2022, 10:19:45 AM

I thought you had the FreeBasic DLL for testing LolRemez...

It would be fun to add it to the SinPlusCos.zip testbed

guga · March 02, 2022, 10:21:27 AM

Quote from: jj2007 on March 02, 2022, 10:19:45 AM
I thought you had the FreeBasic DLL for testing LolRemez...

It would be fun to add it to the testbed

Yes, this is what i´m trying to do. Im quite finishing rebuilding Fb LolRemez to test. The main problem is that i got stuck today on a routine called "GaussJordan". I´m trying to fix it before go further on the porting.

jj2007 · March 02, 2022, 10:50:32 AM

I've tested it now, see above, SinPlusCosV2.zip

It's fast but not very accurate. I wonder how fast it is with the same accuracy as the fpu or FastMath. Also, -1 ... +1 is not the full 0...360 degrees range.

jack · March 02, 2022, 11:17:25 AM

guga, if you are interested I can post a rational minimax approximation of arctan in the interval -1 to 1 with max error 1.34e-16

guga · March 02, 2022, 11:44:19 AM

Quote from: jj2007 on March 02, 2022, 10:50:32 AM
I've tested it now, see above, SinPlusCosV2.zip

It's fast but not very accurate. I wonder how fast it is with the same accuracy as the fpu or FastMath. Also, -1 ... +1 is not the full 0...360 degrees range.

Hi JJ. I saw it. Indeed, it´s very fast, but lack precision (probably because it is using a 4th degree polynomial and not 9th degree). I´ll make further tests on it to se if Fb version is more accurate and how can we improve it more.

guga · March 02, 2022, 11:48:18 AM

Quote from: jack on March 02, 2022, 11:17:25 AM
guga, if you are interested I can post a rational minimax approximation of arctan in the interval -1 to 1 with max error 1.34e-16

Hi Jack, yes, please. :) Also, post how you achieved such result :) It would be nice trying o implement a app using better approximation technique.

guga · March 02, 2022, 12:45:42 PM

Hi JJ

try with these, please:

Code Select

double f(double x)
{
    double u = 2.681518169196906e-6;
    u = u * x + 2.4121467561738585e-5;
    u = u * x + -1.9833112700127753e-4;
    u = u * x + -1.3882965671023472e-3;
    u = u * x + 8.3332934231267097e-3;
    u = u * x + 4.1666455708179865e-2;
    u = u * x + -1.6666665852866353e-1;
    u = u * x + -4.9999997368823986e-1;
    u = u * x + 9.9999999952226541e-1;
    return u * x + 9.999999994749521e-1;
}

I tried to insert it on masmbasic, but an error ocurred

jj2007 · March 02, 2022, 02:07:11 PM

Hi Gustavo,
Here are the results:

Code Select

Intel(R) Core(TM) i5-2450M CPU @ 2.50GHz (SSE4)

2241    cycles for 100 * SinPlusCos FastMath
12007   cycles for 100 * SinPlusCos Fpu
950     cycles for 100 * LolRemez 5
2329    cycles for 100 * LolRemez 8

2106    cycles for 100 * SinPlusCos FastMath
11833   cycles for 100 * SinPlusCos Fpu
950     cycles for 100 * LolRemez 5
2329    cycles for 100 * LolRemez 8

2108    cycles for 100 * SinPlusCos FastMath
11833   cycles for 100 * SinPlusCos Fpu
949     cycles for 100 * LolRemez 5
2330    cycles for 100 * LolRemez 8

2106    cycles for 100 * SinPlusCos FastMath
11832   cycles for 100 * SinPlusCos Fpu
950     cycles for 100 * LolRemez 5
2330    cycles for 100 * LolRemez 8

183     bytes for SinPlusCos FastMath
24      bytes for SinPlusCos Fpu
72      bytes for LolRemez 5
124     bytes for LolRemez 8

Real8   1.383309602960451023    SinPlusCos FastMath
Real8   1.383309602960451023    SinPlusCos Fpu
Real8   1.382865190370866637    LolRemez 5
Real8   0.3833096037593586303   LolRemez 8
exact   1.383309602960451112

There is one tiny bit wrong, it seems. With 8 coefficients, it becomes difficult to keep track... see TestD

LolRemez 8 is slightly slower than FastMath, and it is very limited in its range. You have chosen -1 ... +1, which translates to degrees would be -57...58 degrees. Can you give me the coefficients for -PI ... +PI? That is FastMath's range in the attached version 3 (but only -57...+58 are used, in order to yield the same result as LolRemez).

Re precision: FastMath yields 16 digits of precision, compared to the value that I get with the Windows Calculator.

guga · March 02, 2022, 04:50:01 PM

That´s weird.

I thought that giving a range of -1 to 1 mean s that we are dealing with -Pi to Pi covering the full range, but using normalized values. I´ll take a look later to see how exactly this lolremez works
https://github.com/samhocevar/lolremez

I found some other interesting articles about LolRemez tests here:
http://a-hackers-craic.blogspot.com/2014/07/a-better-faster-sincoscexpi.html
and
https://github.com/samhocevar/lolremez
https://asmbits.blogspot.com/2016/03/vector-sin-and-cos-sse2.html

jack · March 02, 2022, 09:44:48 PM

guga, sorry for being late but just in case you want to give these a try

Code Select


(x *(8883.0849736588075911 + 
     x^2 *(20222.7394253468435665 + 
        x^2 *(16159.2278648857071736 + 
           x^2 *(5387.4278783915353868 + 
              x^2 *(686.94856409857219691 + 
                 22.393048501019117472 *
x^2))))))/(8883.0849736588087835 + 
   x^2 *(23183.7677498992921884 + 
      x^2 *(22110.5334534862024463 + 
         x^2 *(9389.8642844889571880 + 
            x^2 *(1719.75402628941589287 + 
               x^2 *(107.898060132482141870 + x^2))))))

I used Mathematica with the following steps
first load the package

Code Select


<< FunctionApproximations`

then approximate the function

Code Select


f = MiniMaxApproximation[ ArcTan[Sqrt[x]]/Sqrt[x], {x, {1/10^50, 1}, 5, 6},  WorkingPrecision -> 80, Brake -> {500, 5}]

and extract the polynomial from the resulting list

Code Select


f = f[[2, 1]]

the reason for approximating Arctan(Sqrt(x))/Sqrt(x) is to eliminate the odd polynomial so instead of
x - x^3/3 + x^5/5 ... you have 1 - x/3 + x^2/5 - x^3/7 and so the approximating polynomial won't have even powers of x with small coefficients but now we need to convert the polynomial obtained back to odd and at the same time reduce the digits of the polynomial

Code Select


f = N[(f /. x -> x*x)*x, 22]

and convert the polynomial to HornerForm

Code Select


HornerForm[f]

jack · March 02, 2022, 11:11:26 PM

guga, you could reduce the argument as follows, if x=Pi*2 then arctan(x) = 2*arctan(x/(1+sqrt(1+x^2))) so instead of arctan(6.28...) you calculate 2 * arctan(.853...)
reducing the interval to less than 0..1 for the minimax approximation doesn't payoff

jack · March 03, 2022, 12:11:38 AM

if you reduce the argument x as follows ( x in the range -Pi*2 .. Pi*2 )

Code Select


for i=1 to 5
    x=x/(1+sqrt(1+x*x))
next i

then you could calculate the arctan using this polynomial

Code Select


y = 32*((x * (202.153477236938512481 + 
   x^2 * (305.19168618485308633 + 
      x^2 * (127.474551705012762179 + 
         12.6747296167999257198 * x^2))))/(202.153477236938513333 + 
 x^2 * (372.576178597164761655 + 
    x^2 * (211.235915790272755720 + 
       x^2 * (37.4505339820668783381 + x^2)))))

or you could simply use the power series

Code Select


y = 32*(x-x^3/3+x^5/5-x^7/7+x^9/9-x^11/11)

or in HornerForm

Code Select


y = 32*(x*(1+x^2*(-(1/3)+x^2*(1/5+x^2*(-(1/7)+x^2*(1/9-x^2/11))))))

or

Code Select


y = x *(32 + x^2*(-(32/3) + x^2*(32/5 + x^2*(-(32/7) + x^2*(32/9 - (32*x^2)/11)))))

jj2007 · March 03, 2022, 02:08:57 AM

Quote from: guga on March 02, 2022, 04:50:01 PMI thought that giving a range of -1 to 1 mean s that we are dealing with -Pi to Pi covering the full range, but using normalized values.

Nope. If they were normalised, I would not get the correct values...

New version for sin+cos, this is now the code to calculate y(x):

Code Select

	PushPoly ecx*0.01745329251994329577,\
		3.9297499716796932e-2,\
		-1.5653249505225839e-1,\
		-4.9891262416895649e-1,\
		9.9750048049840423e-1,\
		9.9991743032029927e-1
	fstp res8

The *0.017 translates degrees to rad. The two LolRemez functions operate over the limit range of -57...+58 degress. If you give me the coefficients for the full -180...+180 or 0...360 degrees range, I can adapt them accordingly.

Timings:

Code Select

Intel(R) Core(TM) i5-2450M CPU @ 2.50GHz (SSE4)

2256    cycles for 100 * SinPlusCos FastMath
11833   cycles for 100 * SinPlusCos Fpu
1308    cycles for 100 * LolRemez 5
3258    cycles for 100 * LolRemez 8

2106    cycles for 100 * SinPlusCos FastMath
12328   cycles for 100 * SinPlusCos Fpu
1309    cycles for 100 * LolRemez 5
3177    cycles for 100 * LolRemez 8

183     bytes for SinPlusCos FastMath
24      bytes for SinPlusCos Fpu
72      bytes for LolRemez 5
122     bytes for LolRemez 8

Real8   1.383309602960451023    SinPlusCos FastMath
Real8   1.383309602960451023    SinPlusCos Fpu
Real8   1.382865190370866637    LolRemez 5
Real8   1.383309603234310847    LolRemez 8
exact   1.383309602960451112

The MASM Forum

News:

Atan2 SSE2

jj2007

guga

jj2007

guga

jj2007

jack

guga

guga

guga

jj2007

guga

jack

jack

jack

jj2007