News:

Masm32 SDK description, downloads and other helpful links
Message to All Guests

Main Menu

Atan2 SSE2

Started by guga, February 07, 2022, 08:27:40 AM

Previous topic - Next topic

jj2007

Quote from: guga on March 02, 2022, 09:09:07 AM
Try this:

sin(x)+cos(x)

  FastMath SinPlusCos
For_ ct=-1 To 360 ; max 10,000 iterations
fild ct ; X
fld st
fstp REAL10 ptr [edi]
fmul FP8(0.01745329252) ; degrees->rad
fsincos
fadd
fstp REAL10 ptr [edi+REAL10]
add edi, 2*REAL10
Next
  FastMath


NameA equ SinPlusCos FastMath
TestA proc
  mov ebx, AlgoLoops-1 ; loop e.g. 100x
  xor ecx, ecx
  align 4
  .Repeat
void SinPlusCos(ecx) ; uses FastMath
fstp res8
inc ecx
  .Until ecx>=360
  ret
TestA endp
TestA_endp:

align_64
TestB_s:
NameB equ SinPlusCos Fpu
TestB proc
  mov ebx, AlgoLoops-1 ; loop e.g. 100x
  xor ecx, ecx
  align 4
  .Repeat
push ecx
fild stack
pop ecx
fmul FP8(0.01745329252) ; degrees->rad
fsincos ; uses the FPU
fadd
fstp res8
inc ecx
  .Until ecx>=360
  ret
TestB endp


Intel(R) Core(TM) i5-2450M CPU @ 2.50GHz (SSE4)

6745    cycles for 100 * SinPlusCos FastMath
39833   cycles for 100 * SinPlusCos Fpu

6797    cycles for 100 * SinPlusCos FastMath
39854   cycles for 100 * SinPlusCos Fpu

6741    cycles for 100 * SinPlusCos FastMath
39812   cycles for 100 * SinPlusCos Fpu

6745    cycles for 100 * SinPlusCos FastMath
39815   cycles for 100 * SinPlusCos Fpu

186     bytes for SinPlusCos FastMath
27      bytes for SinPlusCos Fpu

Real8   0.9823952887398167411   SinPlusCos FastMath
Real8   0.9823952887398167411   SinPlusCos Fpu
exact   0.9823952887191077263


Now the question is, of course, how fast and how accurate is LolRemez :rolleyes:

guga

Great work. But still uses a precalculate table, right ?

About Lolremez, i´m not sure how accurate it is. I didn´t tested the results on the example i gave. For what i saw so far, the accuracy depends on the amount of polynomials to be calculated. For example, using a 4th degree polyomial is less accurate than using a 9th degree. If we could build a routine that can be able to do the maths of those equations using the polynomials approximations (ChebyChev, LolRemez, MinMax etc etc), we can achieve a higher level of accuracy using more polynomials (and yet, this shouldn´t affect performance since we can do it in SSE2 using few instructions, without any loops).

This library seems to have a more accurate results, but didn´t tested yet.
https://www.boost.org/doc/libs/1_37_0/libs/math/doc/sf_and_dist/html/math_toolkit/using_udt/use_ntl.html

Or this one..seems to be even more accurated

https://www.boost.org/doc/libs/1_37_0/libs/math/doc/sf_and_dist/html/math_toolkit/backgrounders/lanczos.html
Coding in Assembly requires a mix of:
80% of brain, passion, intuition, creativity
10% of programming skills
10% of alcoholic levels in your blood.

My Code Sites:
http://rosasm.freeforums.org
http://winasm.tripod.com

jj2007

I thought you had the FreeBasic DLL for testing LolRemez... :cool:

It would be fun to add it to the SinPlusCos.zip testbed

guga

Quote from: jj2007 on March 02, 2022, 10:19:45 AM
I thought you had the FreeBasic DLL for testing LolRemez... :cool:

It would be fun to add it to the testbed

Yes, this is what i´m trying to do. Im quite finishing rebuilding Fb LolRemez to test. The main problem is that i got stuck today on a routine called "GaussJordan". I´m trying to fix it before go further on the porting.
Coding in Assembly requires a mix of:
80% of brain, passion, intuition, creativity
10% of programming skills
10% of alcoholic levels in your blood.

My Code Sites:
http://rosasm.freeforums.org
http://winasm.tripod.com

jj2007

I've tested it now, see above, SinPlusCosV2.zip

It's fast but not very accurate. I wonder how fast it is with the same accuracy as the fpu or FastMath. Also, -1 ... +1 is not the full 0...360 degrees range.

jack

guga, if you are interested I can post a rational minimax approximation of arctan in the interval -1 to 1 with max error 1.34e-16

guga

Quote from: jj2007 on March 02, 2022, 10:50:32 AM
I've tested it now, see above, SinPlusCosV2.zip

It's fast but not very accurate. I wonder how fast it is with the same accuracy as the fpu or FastMath. Also, -1 ... +1 is not the full 0...360 degrees range.

Hi JJ. I saw it. Indeed, it´s very fast, but lack precision (probably because it is using a 4th degree polynomial and not 9th degree). I´ll make further tests on it to se if Fb version is more accurate and how can we improve it more.

Coding in Assembly requires a mix of:
80% of brain, passion, intuition, creativity
10% of programming skills
10% of alcoholic levels in your blood.

My Code Sites:
http://rosasm.freeforums.org
http://winasm.tripod.com

guga

Quote from: jack on March 02, 2022, 11:17:25 AM
guga, if you are interested I can post a rational minimax approximation of arctan in the interval -1 to 1 with max error 1.34e-16

Hi Jack, yes, please. :) Also, post how you achieved such result :) It would be nice trying o implement a app using better approximation technique.
Coding in Assembly requires a mix of:
80% of brain, passion, intuition, creativity
10% of programming skills
10% of alcoholic levels in your blood.

My Code Sites:
http://rosasm.freeforums.org
http://winasm.tripod.com

guga

Hi JJ

try with these, please:

double f(double x)
{
    double u = 2.681518169196906e-6;
    u = u * x + 2.4121467561738585e-5;
    u = u * x + -1.9833112700127753e-4;
    u = u * x + -1.3882965671023472e-3;
    u = u * x + 8.3332934231267097e-3;
    u = u * x + 4.1666455708179865e-2;
    u = u * x + -1.6666665852866353e-1;
    u = u * x + -4.9999997368823986e-1;
    u = u * x + 9.9999999952226541e-1;
    return u * x + 9.999999994749521e-1;
}


I tried to insert it on masmbasic, but an error ocurred
Coding in Assembly requires a mix of:
80% of brain, passion, intuition, creativity
10% of programming skills
10% of alcoholic levels in your blood.

My Code Sites:
http://rosasm.freeforums.org
http://winasm.tripod.com

jj2007

Hi Gustavo,
Here are the results:

Intel(R) Core(TM) i5-2450M CPU @ 2.50GHz (SSE4)

2241    cycles for 100 * SinPlusCos FastMath
12007   cycles for 100 * SinPlusCos Fpu
950     cycles for 100 * LolRemez 5
2329    cycles for 100 * LolRemez 8

2106    cycles for 100 * SinPlusCos FastMath
11833   cycles for 100 * SinPlusCos Fpu
950     cycles for 100 * LolRemez 5
2329    cycles for 100 * LolRemez 8

2108    cycles for 100 * SinPlusCos FastMath
11833   cycles for 100 * SinPlusCos Fpu
949     cycles for 100 * LolRemez 5
2330    cycles for 100 * LolRemez 8

2106    cycles for 100 * SinPlusCos FastMath
11832   cycles for 100 * SinPlusCos Fpu
950     cycles for 100 * LolRemez 5
2330    cycles for 100 * LolRemez 8

183     bytes for SinPlusCos FastMath
24      bytes for SinPlusCos Fpu
72      bytes for LolRemez 5
124     bytes for LolRemez 8

Real8   1.383309602960451023    SinPlusCos FastMath
Real8   1.383309602960451023    SinPlusCos Fpu
Real8   1.382865190370866637    LolRemez 5
Real8   0.3833096037593586303   LolRemez 8
exact   1.383309602960451112


There is one tiny bit wrong, it seems. With 8 coefficients, it becomes difficult to keep track... see TestD :sad:
LolRemez 8 is slightly slower than FastMath, and it is very limited in its range. You have chosen -1 ... +1, which translates to degrees would be -57...58 degrees. Can you give me the coefficients for -PI ... +PI? That is FastMath's range in the attached version 3 (but only -57...+58 are used, in order to yield the same result as LolRemez).

Re precision: FastMath yields 16 digits of precision, compared to the value that I get with the Windows Calculator.

guga

That´s weird.

I thought that giving a range of -1 to 1 mean s that we are dealing with -Pi to Pi covering the full range, but using normalized values. I´ll take a look later to see how exactly this lolremez works
https://github.com/samhocevar/lolremez

I found some other interesting articles about LolRemez tests here:
http://a-hackers-craic.blogspot.com/2014/07/a-better-faster-sincoscexpi.html
and
https://github.com/samhocevar/lolremez
https://asmbits.blogspot.com/2016/03/vector-sin-and-cos-sse2.html
Coding in Assembly requires a mix of:
80% of brain, passion, intuition, creativity
10% of programming skills
10% of alcoholic levels in your blood.

My Code Sites:
http://rosasm.freeforums.org
http://winasm.tripod.com

jack

#56
guga, sorry for being late but just in case you want to give these a try

(x *(8883.0849736588075911 +
     x^2 *(20222.7394253468435665 +
        x^2 *(16159.2278648857071736 +
           x^2 *(5387.4278783915353868 +
              x^2 *(686.94856409857219691 +
                 22.393048501019117472 *
x^2))))))/(8883.0849736588087835 +
   x^2 *(23183.7677498992921884 +
      x^2 *(22110.5334534862024463 +
         x^2 *(9389.8642844889571880 +
            x^2 *(1719.75402628941589287 +
               x^2 *(107.898060132482141870 + x^2))))))

I used Mathematica with the following steps
first load the package

<< FunctionApproximations`

then approximate the function

f = MiniMaxApproximation[ ArcTan[Sqrt[x]]/Sqrt[x], {x, {1/10^50, 1}, 5, 6},  WorkingPrecision -> 80, Brake -> {500, 5}]

and extract the polynomial from the resulting list

f = f[[2, 1]]

the reason for approximating Arctan(Sqrt(x))/Sqrt(x) is to eliminate the odd polynomial so instead of
x - x^3/3 + x^5/5 ... you have 1 - x/3 + x^2/5 - x^3/7 and so the approximating polynomial won't have even powers of x with small coefficients but now we need to convert the polynomial obtained back to odd and at the same time reduce the digits of the polynomial

f = N[(f /. x -> x*x)*x, 22]

and convert the polynomial to HornerForm

HornerForm[f]


jack

#57
guga, you could reduce the argument as follows, if x=Pi*2 then arctan(x) = 2*arctan(x/(1+sqrt(1+x^2))) so instead of arctan(6.28...) you calculate 2 * arctan(.853...)
reducing the interval to less than 0..1 for the minimax approximation doesn't payoff

jack

#58
if you reduce the argument x as follows ( x in the range -Pi*2 .. Pi*2 )

for i=1 to 5
    x=x/(1+sqrt(1+x*x))
next i

then you could calculate the arctan using this polynomial

y = 32*((x * (202.153477236938512481 +
   x^2 * (305.19168618485308633 +
      x^2 * (127.474551705012762179 +
         12.6747296167999257198 * x^2))))/(202.153477236938513333 +
x^2 * (372.576178597164761655 +
    x^2 * (211.235915790272755720 +
       x^2 * (37.4505339820668783381 + x^2)))))

or you could simply use the power series

y = 32*(x-x^3/3+x^5/5-x^7/7+x^9/9-x^11/11)

or in HornerForm

y = 32*(x*(1+x^2*(-(1/3)+x^2*(1/5+x^2*(-(1/7)+x^2*(1/9-x^2/11))))))

or

y = x *(32 + x^2*(-(32/3) + x^2*(32/5 + x^2*(-(32/7) + x^2*(32/9 - (32*x^2)/11)))))

jj2007

Quote from: guga on March 02, 2022, 04:50:01 PMI thought that giving a range of -1 to 1 mean s that we are dealing with -Pi to Pi covering the full range, but using normalized values.

Nope. If they were normalised, I would not get the correct values...

New version for sin+cos, this is now the code to calculate y(x):

PushPoly ecx*0.01745329251994329577,\
3.9297499716796932e-2,\
-1.5653249505225839e-1,\
-4.9891262416895649e-1,\
9.9750048049840423e-1,\
9.9991743032029927e-1
fstp res8


The *0.017 translates degrees to rad. The two LolRemez functions operate over the limit range of -57...+58 degress. If you give me the coefficients for the full -180...+180 or 0...360 degrees range, I can adapt them accordingly.

Timings:
Intel(R) Core(TM) i5-2450M CPU @ 2.50GHz (SSE4)

2256    cycles for 100 * SinPlusCos FastMath
11833   cycles for 100 * SinPlusCos Fpu
1308    cycles for 100 * LolRemez 5
3258    cycles for 100 * LolRemez 8

2106    cycles for 100 * SinPlusCos FastMath
12328   cycles for 100 * SinPlusCos Fpu
1309    cycles for 100 * LolRemez 5
3177    cycles for 100 * LolRemez 8

183     bytes for SinPlusCos FastMath
24      bytes for SinPlusCos Fpu
72      bytes for LolRemez 5
122     bytes for LolRemez 8

Real8   1.383309602960451023    SinPlusCos FastMath
Real8   1.383309602960451023    SinPlusCos Fpu
Real8   1.382865190370866637    LolRemez 5
Real8   1.383309603234310847    LolRemez 8
exact   1.383309602960451112