The MASM Forum

General => The Laboratory => Topic started by: guga on February 07, 2022, 08:27:40 AM

Title: Atan2 SSE2
Post by: guga on February 07, 2022, 08:27:40 AM
Hi Guys

Someone succeeded to create a atan2 function similar (or even better) to the ones existent in Intel ?

https://www.intel.com/content/www/us/en/develop/documentation/cpp-compiler-developer-guide-and-reference/top/compiler-reference/intrinsics/intrinsics-for-short-vector-math-library-ops/intrinsics-for-trigonometric-operations/mm-atan2-pd-mm256-atan2-pd.html
Title: Re: Atan2 SSE2
Post by: jj2007 on February 07, 2022, 09:54:42 AM
Have you seen How to Find a Fast Floating-Point atan2 Approximation (https://www.dsprelated.com/showarticle/1052.php)?

If your problem is to find atan2(X,Y), and either X or Y is a constant, then FastMath (http://masm32.com/board/index.php?topic=8779.0) would be an option.

Here is one that I don't understand (https://gist.github.com/tobin/2028637) :sad:
Title: Re: Atan2 SSE2
Post by: guga on February 07, 2022, 11:51:36 AM
Hi JJ. I´ll give a try.

Also, ucrtbase.dll have a atan2 function using SSE2 ( ___libm_sse2_atan2), but i didn´t tested yet for speed. I wonder if there is a faster way to do it. For the normal way using FPU, i did some routine years ago that can also convert to Hue angles. The problem is that it needs to be fst, and i presume the FPU way is not fast enough, specially when i plan to use it with Marinus FFT routines for audio.


[Float_AtanPiFactor: R$ (180/3.1415926535897932384626433832795)]
[Float360: R$ 360]

Proc atan2:
    Arguments @pY, @pX, @ConvDegree
    Structure @TempStorage 16, @TmpDataDis 0
    Uses eax, ebx, ecx

    mov ebx D@pY
    mov ecx D@pX

    fld R$ebx
    fld R$ecx
    fpatan
    fstsw ax
    wait
    shr ax 1
    jnb L2>
        fclex | stc | xor eax eax | ExitP
L2:

    .If D@ConvDegree = &TRUE
        fmul R$Float_AtanPiFactor | fst R@TmpDataDis
        Fpu_If R@TmpDataDis < R$FloatZero
            fadd R$Float360
        Fpu_Else_If R@TmpDataDis >= R$Float360
            fsub R$Float360
        Fpu_End_If
    .End_If

    clc
    mov eax &TRUE

EndP
Title: Re: Atan2 SSE2
Post by: guga on February 07, 2022, 02:50:29 PM
Found something interesting using polynomials

https://mazzo.li/posts/vectorized-atan2.html
https://pub.dev/documentation/complex/latest/fastmath/atan.html
https://www.dsprelated.com/showarticle/1052.php
https://opensource.apple.com/source/Libm/Libm-315/Source/Intel/atan.c
https://stackoverflow.com/questions/11930594/calculate-atan2-without-std-functions-or-c99
https://www.examplefiles.net/cs/453132
Title: Re: Atan2 SSE2
Post by: guga on February 08, 2022, 11:10:30 PM
Hi guys

Can someone test the timmings for this please ?

I succeeded to rewrite a atan2 functions using SSE2 from ucrtbase.dll and wanted to know if it is really that fast as it seems (also compared to the original version as well). The function works the same as in __libm_sse2_atan2 from ucrtbase.dll. I optimized it a little bit only reorganizing the function. (The original version is a kind of a mess due to heavy spaghetti code)

I´m pretty sure it can be optimized further for speed, but i didn´t fully understood yet, how it works.

I built it on a dll just to make easier to the benchmark tests. The function is Sse2_atan2 in the test.dll. It doesn't contains any parameters.

To calculate the atan2 you need to place the values of x, y (in double) in the registers XMM0 and XMM1 (as it is on the original version).

In Rosasm syntax the function can be called as:

[MyValue1: R$ 0.554545]
[MyValue2: R$ 0.1]

movupd XMM0 X$MyValue1
movupd XMM1 X$MyValue2

call Sse2_atan2

The original version (and also mine) uses a table of tangents to calculate the atan2 of a value. The table consists of 164 pairs of Real8 values. The 1st one representing the tangent of a certain value and the next Real8 seems to be some error /difference of some sort.

For my tests, i excluded this error case and used only the true calculated values of the tangents.

This table seems to be generated on  a certain order. It consists of 164 values whose last starting from 0.02928849741073 to Pi/2. So it starts with the tangent of a certain value: Value at Pos1 = atan(0.029296875) to atan(32) (Value at Pos 163).  And the last value (Pos164) since it cannot be atan(pi/2), it restricted to  Pi/2. On each 16 values (from the last to the 1st - excepting the last one), it decreaes the value to be calculate by 1/2, starting with 1.

So, at Pos 163 = atan(32) ---> decreasing 1 on each loop until Pos 147
Pos 147 = atan(16) ---> decreasing the half of 1. So, decreasing 0.5 on the next loop untill Pos131
Pos131 = atan(8) ---> decreasing the half of 0.5. So, decreasing 0.25 on the next loop untill Pos115
etc


I built a function that reproduces the values of the original table (except the 2nd Real8 for error cases, which i filled only with 0 for now):



[MyTangTBl: R$ 0 #(164*2)]
[Float_PI: R$ 3.141592653589793238462643383279502884197169399375105820974944592307]

Proc CreateTangetTable:
    Local @Counter, @Value, @InternalCounter, @DecreaseRate
    Uses esi

    mov D@Counter 164
    mov D@InternalCounter 0
    fld1 | fstp F@DecreaseRate

    mov esi MyTangTBl
    add esi (164*2*8) | sub esi (8*2)
    fld R$Float_PI | fstp R$esi

    dec D@Counter
    sub esi (8*2)
    mov D@Value 33 | fild D@Value | fstp F@Value
    .Do

        fld D@Value | fsub F@DecreaseRate | fst F@Value | fld1 | fpatan | fstp R$esi
           
        If D@InternalCounter = 16
            fld F@DecreaseRate | fmul F$Float_half | fstp F@DecreaseRate
            mov D@InternalCounter 0
        End_If

        inc D@InternalCounter
        sub esi (8*2)
        dec D@Counter
    .Loop_Until D@Counter = 0
L1:
EndP


The original values of this table (including the error values - I didn´t included the error values on mine version, btw, but it don´t affect the precision too much) is :

DoubleData <0.02928849741073059, 3.878020342543118e-16>
DoubleData <0.03026419423862503, 2.13439966346682e-16>
DoubleData <0.03123983343026815, 1.27181094510696e-16>
DoubleData <0.03319093149711083, 7.623353902758895e-16>
DoubleData <0.03514177680279662, 1.622534112980482e-16>
DoubleData <0.03709235455039117, 6.433766265361664e-16>
DoubleData <0.03904264995516638, 6.181886837851646e-16>
DoubleData <0.04099264824526294, 8.420830387839661e-16>
DoubleData <0.04294233466236186, 3.142811786546985e-16>
DoubleData <0.04489169446234609, 4.096264921495231e-16>
DoubleData <0.04684071291596936, 2.912679762198781e-16>
DoubleData <0.04878937530951521, 4.053693341011583e-16>
DoubleData <0.05073766694545956, 6.612695435815108e-16>
DoubleData <0.05268557314312972, 3.289966758135281e-16>
DoubleData <0.05463307923935901, 4.691749851066725e-16>
DoubleData <0.05658017058914488, 8.289832696419427e-16>
DoubleData <0.05852683256630176, 1.362951662640667e-17>
DoubleData <0.06047305056410668, 6.37811249269131e-16>
DoubleData <0.06241880999595661, 7.339736781833367e-16>
DoubleData <0.06630889491982295, 5.407451321058335e-16>
DoubleData <0.07019697107187017, 3.451465030350394e-16>
DoubleData <0.07408292254903337, 3.609579312927062e-16>
DoubleData <0.0779666338315419, 4.082603982997626e-16>
DoubleData <0.08184798980307573, 8.205279419748868e-16>
DoubleData <0.08572687577074412, 6.992365845342258e-16>
DoubleData <0.08960317748487157, 1.793304156180366e-16>
DoubleData <0.09347678115858926, 2.018823445176747e-16>
DoubleData <0.09734757348722312, 5.576265782080219e-16>
DoubleData <0.101215441667466, 6.856928051415318e-16>
DoubleData <0.1050802734163288, 7.80192436556187e-16>
DoubleData <0.1089419569898658, 4.846007563068433e-17>
DoubleData <0.1128003812016587, 6.957566319383172e-16>
DoubleData <0.1166554354410687, 6.438661649715736e-16>
DoubleData <0.1205070096912237, 8.182569515647907e-16>
DoubleData <0.1243549945467608, 6.352529150170111e-16>
DoubleData <0.1320397616146387, 4.274189722415823e-17>
DoubleData <0.1397088742891635, 1.913310428846708e-16>
DoubleData <0.1473614810886508, 8.658324432321626e-16>
DoubleData <0.1549967419239406, 3.426523229816613e-16>
DoubleData <0.1626138285979479, 6.461627098025713e-16>
DoubleData <0.1702119252854741, 2.74014592076487e-16>
DoubleData <0.1777902289926754, 6.898598082898684e-16>
DoubleData <0.1853479499956947, 5.969184350010091e-17>
DoubleData <0.1928843122579744, 2.42385590364413e-16>
DoubleData <0.2003985538258783, 1.696734079809579e-16>
DoubleData <0.2078899272022623, 7.012225510572437e-16>
DoubleData <0.215357699697738, 5.59849672442657e-17>
DoubleData <0.2228011537593941, 3.830792364463579e-16>
DoubleData <0.2302195872768431, 6.229360680729788e-16>
DoubleData <0.2376123138654709, 3.71404797316887e-16>
DoubleData <0.2449786631268633, 8.156104484719729e-16>
DoubleData <0.2596296294082574, 1.30261057387131e-16>
DoubleData <0.2741674511196583, 4.523505634252264e-16>
DoubleData <0.2885873618940771, 3.187832078137744e-16>
DoubleData <0.3028848683749708, 6.551229868720925e-16>
DoubleData <0.3170557532091465, 5.361722230696518e-16>
DoubleData <0.3310960767041315, 5.471589019367845e-16>
DoubleData <0.345002177207105, 8.808349770693735e-17>
DoubleData <0.3587706702705722, 3.088733564861919e-17>
DoubleData <0.3723984466767538, 4.081903701236505e-16>
DoubleData <0.3858826693980735, 3.013439834812086e-16>
DoubleData <0.3992207695752521, 4.665551909062331e-16>
DoubleData <0.4124104415973866, 7.057684437286449e-16>
DoubleData <0.4254496373700416, 6.894493455169868e-16>
DoubleData <0.4383365598579569, 8.632356493938598e-16>
DoubleData <0.4510696559885234, 3.280735600183736e-17>
DoubleData <0.4636476090008053, 8.553660459218291e-16>
DoubleData <0.4883339510564051, 4.32715973660733e-16>
DoubleData <0.5123894603107368, 8.627156382272694e-16>
DoubleData <0.5358112379604636, 1.069585067790331e-16>
DoubleData <0.558599315343562, 4.38633579301471e-16>
DoubleData <0.5807563535676703, 9.660765868058498e-17>
DoubleData <0.6022873461349638, 3.62571214759831e-16>
DoubleData <0.6231993299340655, 4.708132487014636e-16>
DoubleData <0.6435011087932843, 1.268570875139599e-16>
DoubleData <0.6632029927060925, 7.463955685933131e-16>
DoubleData <0.6823165548747481, 6.943223671560008e-18>
DoubleData <0.7008544078844494, 7.572798548942514e-16>
DoubleData <0.7188299996216241, 4.226108214056057e-16>
DoubleData <0.7362574289814274, 7.008731912580885e-16>
DoubleData <0.7531512809621939, 5.308545776533963e-16>
DoubleData <0.7695264804056574, 8.51128500644098e-16>
DoubleData <0.7853981633974483, 3.061616997868383e-17>
DoubleData <0.8156919233162228, 6.554192491473065e-16>
DoubleData <0.8441539861131702, 8.397650495807761e-16>
DoubleData <0.8709034570756522, 7.544578813301367e-16>
DoubleData <0.8960553845713433, 6.95372577632837e-16>
DoubleData <0.9197196053504166, 1.814702107965036e-16>
DoubleData <0.9420000403794635, 1.656306773209825e-16>
DoubleData <0.9629943306809361, 1.070356418673049e-16>
DoubleData <0.9827937232473287, 3.46970218418778e-16>
DoubleData <1.001483135694234, 7.605168950105478e-16>
DoubleData <1.019141344266349, 6.761378336444607e-16>
DoubleData <1.0358412530088, 3.194313981784504e-17>
DoubleData <1.051650212548373, 5.696281674604188e-16>
DoubleData <1.066630365315743, 6.065679184034902e-16>
DoubleData <1.080839000541168, 2.063682824136722e-16>
DoubleData <1.09432890732119, 2.165539287700089e-16>
DoubleData <1.10714871779409, 9.404471373566379e-17>
DoubleData <1.13095374397916, 5.153275478954471e-16>
DoubleData <1.152571997215667, 1.304472198360275e-16>
DoubleData <1.172273881128476, 7.499857009153807e-16>
DoubleData <1.190289949682532, 7.683333629842069e-17>
DoubleData <1.206817370285252, 2.637692813136457e-16>
DoubleData <1.222025323210989, 6.363421861261654e-16>
DoubleData <1.236059489478081, 7.449313421696882e-16>
DoubleData <1.249045772398254, 8.859822159005129e-16>
DoubleData <1.26109338225244, 2.544660011403809e-16>
DoubleData <1.272297395208717, 2.245875015034507e-17>
DoubleData <1.28274087974427, 8.788952309458591e-16>
DoubleData <1.292496667789785, 1.537365572357647e-16>
DoubleData <1.301628834009196, 2.09675419926785e-16>
DoubleData <1.310193935047555, 7.535879521228967e-16>
DoubleData <1.318242051016837, 3.808952695386159e-16>
DoubleData <1.325817663668032, 1.3380031118552e-16>
DoubleData <1.339705659598999, 6.401436961720526e-16>
DoubleData <1.352127380920955, 2.147674250751151e-17>
DoubleData <1.363300100359694, 3.313692220777249e-16>
DoubleData <1.373400766945015, 6.330567112173988e-16>
DoubleData <1.382574821490126, 1.86429700538549e-16>
DoubleData <1.390942827002418, 7.897412983652368e-16>
DoubleData <1.398605512271957, 4.208485980241463e-16>
DoubleData <1.40564764938027, 1.328183035426864e-16>
DoubleData <1.412141064608495, 5.703957436695217e-16>
DoubleData <1.418146998399631, 8.055395818750151e-16>
DoubleData <1.423717971406494, 3.09263314147271e-16>
DoubleData <1.428899272190733, 1.574732574926438e-16>
DoubleData <1.433730152484708, 8.442163750324489e-16>
DoubleData <1.438244794498222, 6.412036156724483e-16>
DoubleData <1.442473099109101, 7.776664761570937e-16>
DoubleData <1.446441332248135, 3.141578446404818e-16>
DoubleData <1.453687582228032, 8.199907271986118e-16>
DoubleData <1.460139105621001, 2.846543520033397e-16>
DoubleData <1.465919388064663, 1.24932049384283e-16>
DoubleData <1.471127674303734, 7.79686192079511e-16>
DoubleData <1.47584462045214, 4.779648065777258e-16>
DoubleData <1.480136439594151, 3.07070859685828e-16>
DoubleData <1.484057988118911, 4.096346991714267e-16>
DoubleData <1.487655094906455, 7.844371739461077e-17>
DoubleData <1.490966341082659, 6.221434760020122e-17>
DoubleData <1.494024435525118, 8.134142446723606e-16>
DoubleData <1.496857289136956, 6.830938909876451e-16>
DoubleData <1.499488862009605, 8.873091857396741e-16>
DoubleData <1.501939837493851, 7.267528106158521e-16>
DoubleData <1.504228163019072, 4.532269913842895e-16>
DoubleData <1.506369487369343, 5.605999046418969e-16>
DoubleData <1.508377516798939, 2.154370814741562e-16>
DoubleData <1.512040504079174, 1.402618800554007e-16>
DoubleData <1.515297821549179, 5.368157936821226e-16>
DoubleData <1.518213265183954, 7.375391359310956e-16>
DoubleData <1.520837931072953, 6.825613612540716e-16>
DoubleData <1.523213223517913, 4.501543596256141e-16>
DoubleData <1.52537304737332, 2.482983385700394e-17>
DoubleData <1.527345431403365, 7.934779986221159e-16>
DoubleData <1.529153747696308, 6.760940733854243e-16>
DoubleData <1.530817639671606, 3.549557335150754e-16>
DoubleData <1.532353736773708, 2.956322283362237e-16>
DoubleData <1.533776210920966, 9.377354806572844e-17>
DoubleData <1.535097214115572, 7.767503702335957e-16>
DoubleData <1.536327225795388, 8.864446875291863e-16>
DoubleData <1.537475330916649, 8.116858600761242e-17>
DoubleData <1.538549444359642, 5.614707476320801e-16>
DoubleData <1.539556493364628, 8.22229665146797e-16>
DoubleData <1.570796326794897, 6.123233995736766e-17>


The source is embedded in the dll but i can post it here later if the tests are ok. It shouldn´t be hard to port to masm, if someone is interested to work further with the function.

As the original version it can only work with 1 calculated value at once. (Double). But, i presume it could be optimized to handle 4 Floats at once to gain even more speed.
Title: Re: Atan2 SSE2
Post by: guga on February 08, 2022, 11:57:37 PM
Btw, to convert it to degree you can also use this:



[Float_AtanPiFactor: R$ (180/3.1415926535897932384626433832795)]
[Float360: R$ 360]
[Float_Zero: R$ 0]

Proc SSE_Atan2Double:
    Arguments @pY, @pX, @ConvDegree
    Uses eax, ebx, ecx

    mov ebx D@pY
    mov ecx D@pX

    movupd XMM0 X$ebx
    movupd XMM1 X$ecx
    ;call 'ucrtbase.__libm_sse2_atan2' <---original version inside ucrtbase.dll
    call Sse2_atan2 <---mine tests.

    .If D@ConvDegree = &TRUE ; <--- Flag to convert to degree (True/False)
        movupd XMM1 X$Float_AtanPiFactor
        mulpd XMM0 XMM1
        xorpd xmm2 xmm2
        SSE_D_If xmm0 < xmm2 ; A simple macro to handle sse2 comparisons. It is simply: COMISD XMM0 XMM2 | JNB D0> So, it is comisd followed by the corresponding jmp to D0 (Our Else_IF macro)
            addpd xmm0 X$Float360
        SSE_D_Else_If xmm0 > X$Float360 ; same as above:     JMP D1> | D0:COMISD XMM0 X$FLOAT360 | JNA D0>
            subpd xmm0 X$Float360
        SSE_D_End_If ; The ending of the jumps D0: D1:
    .End_If

    mov eax &TRUE

EndP


Example of usage:

[TestingAtan2Data1: R$ 0.5317094316614787480759158718400589803054643151426570508592559650]
[TestingAtan2Data2: R$ 1]

call SSE_Atan2Double TestingAtan2Data1, TestingAtan2Data2, &TRUE

It will return in ST0 = 28º
Title: Re: Atan2 SSE2
Post by: jj2007 on February 09, 2022, 01:07:26 AM
Quote from: guga on February 08, 2022, 11:10:30 PM
Can someone test the timmings for this please ?

It's a DLL, Gustavo... how are we supposed to test it?
Title: Re: Atan2 SSE2
Post by: guga on February 09, 2022, 05:13:51 AM
The same way as you did for the FastMath. Didn´t you used a external dll to call the function ? The benchmark apps can handle external dlls, right ?

something like

importlib test.lib

invoke Sse2_atan2

But, I can try disassemble it and create a source compatible to masm, if needed.
Title: Re: Atan2 SSE2
Post by: guga on February 09, 2022, 05:25:15 AM
Please, see if this helps (It´s the disassembled source)
Title: Re: Atan2 SSE2
Post by: jj2007 on February 09, 2022, 06:04:39 AM
Quote from: guga on February 09, 2022, 05:13:51 AM
The same way as you did for the FastMath. Didn´t you used a external dll to call the function ? The benchmark apps can handle external dlls, right ?

something like

importlib test.lib

invoke Sse2_atan2

It can, Guga, but I need a minimum of documentation. Something like atan2=Sse2_atan2(double X, double Y)?
For now,
  Dll "GugaAtan2"
  Declare Sse2_atan2, 2


fails miserably saying "can't find module Sse2_atan2" :sad:
Title: Re: Atan2 SSE2
Post by: HSE on February 09, 2022, 07:23:02 AM
Working here! (using source code)
No timing yet

erased a minor detail :biggrin:
Title: Re: Atan2 SSE2
Post by: guga on February 09, 2022, 07:30:32 AM
Quote from: jj2007 on February 09, 2022, 06:04:39 AM
Quote from: guga on February 09, 2022, 05:13:51 AM
The same way as you did for the FastMath. Didn´t you used a external dll to call the function ? The benchmark apps can handle external dlls, right ?

something like

importlib test.lib

invoke Sse2_atan2

It can, Guga, but I need a minimum of documentation. Something like atan2=Sse2_atan2(double X, double Y)?
For now,
  Dll "GugaAtan2"
  Declare Sse2_atan2, 2


fails miserably saying "can't find module Sse2_atan2" :sad:

That´s weird. The exported function i named as "Sse2_atan2". Are you sure you are importing test.dll ? Because the name of the dll is not "GugaAtan2"

Shouldn´t it be ?

Dll "test"
Declare Sse2_atan2, 1

The ordinal value of the function inside the dll is 1 and not 2
Title: Re: Atan2 SSE2
Post by: guga on February 09, 2022, 07:32:27 AM
Quote from: HSE on February 09, 2022, 07:23:02 AM
Working here!
No timing yet

erased a minor detail :biggrin:

Tks HSE  :thumbsup: :thumbsup: :thumbsup:
Title: Re: Atan2 SSE2
Post by: guga on February 09, 2022, 08:09:41 AM
Quote from: jj2007 on February 09, 2022, 06:04:39 AM
Quote from: guga on February 09, 2022, 05:13:51 AM
The same way as you did for the FastMath. Didn´t you used a external dll to call the function ? The benchmark apps can handle external dlls, right ?

something like

importlib test.lib

invoke Sse2_atan2

It can, Guga, but I need a minimum of documentation. Something like atan2=Sse2_atan2(double X, double Y)?
For now,
  Dll "GugaAtan2"
  Declare Sse2_atan2, 2


fails miserably saying "can't find module Sse2_atan2" :sad:

Sorry, i forgot to answer this

"Sse2_atan2(double X, double Y)?"

Yes, the contents of xmm1 and xmm0 are doubles.(The same as in the original library - ucrtbase.dll). The function does not use any parameter, but you must feed  xmm0 and xmm1 registers with double values.

Internally i rewrote the function as this:


; The table was generated as CreateTangetTable
; I rewrote/reorganized the whole function to avoid heavy usage of spaghetti code as in the original version inside ucrtbase.dll

Proc Sse2_atan2::

    movlpd X$SSEAtanTmpVal1 xmm0 ; <-----Temporary global variable to store the data to be used in the internal functions
    movlpd X$SSEAtanTmpVal2 xmm1 ; <-----Temporary global variable to store the data to be used in the internal functions

    pextrw eax xmm0 3 | and eax 07FF0 | sub eax 03870
    pextrw ecx xmm1 3 | and ecx 07FF0 | sub ecx 03870

    .If_And eax <= 0F00, ecx <= 0F00
        call Sse2_atan2Internal4
    .Else
        pextrw ecx xmm0 3 | and ecx 07FF0
        pextrw eax xmm1 3 | and eax 07FF0
        If ecx = 07FF0
            call Sse2_atan2Internal1
        Else_If eax = 07FF0
            call Sse2_atan2Internal2
        Else
            call Sse2_atan2Internal5
        End_If
    .End_If

EndP
Title: Re: Atan2 SSE2
Post by: jj2007 on February 09, 2022, 08:10:55 AM
Quote from: guga on February 09, 2022, 07:30:32 AMDll "test"
Declare Sse2_atan2, 1

The ordinal value of the function inside the dll is 1 and not 2

I changed the name of the DLL to avoid ambiguities. Declare (https://www.jj2007.eu/MasmBasicQuickReference.htm#Mb1017) expects the number of args, but I found out debugging your DLL that you pass the two arguments in xmm0 and xmm1 (which is pretty unorthodox).

So, here are results for Sse2_atan2(1.0, 2.0) - as compared to ucrtbase atan2:

Intel(R) Core(TM) i5-2450M CPU @ 2.50GHz (SSE4)

6472    cycles for 100 * Sse2_atan2
17801   cycles for 100 * ucrtbase atan2

6443    cycles for 100 * Sse2_atan2
17799   cycles for 100 * ucrtbase atan2

6424    cycles for 100 * Sse2_atan2
17810   cycles for 100 * ucrtbase atan2

6423    cycles for 100 * Sse2_atan2
17841   cycles for 100 * ucrtbase atan2

123     bytes for Sse2_atan2
127     bytes for ucrtbase atan2

Real8   -56.30993247402027180   Sse2_atan2
Real8   -56.30993247402021495   ucrtbase atan2


The two calls are as follows:

  Dll "GugaAtan2"
  Declare Sse2_atan2, 0   ; 0 arguments because they are passed in xmm0 and xmm1
  Dll "ucrtbase"
  Declare atan2, C:2      ; 2 arguments, C calling convention
...
movlps xmm0, FP8(-33.0)
movlps xmm1, FP8(22.0)
void Sse2_atan2()

void atan2(FP8(-33.0), FP8(22.0))


See TestA and TestB for differences in the way the two algos return their results.
Title: Re: Atan2 SSE2
Post by: guga on February 09, 2022, 08:23:01 AM
Hi JJ. Many tks. It seems pretty fast :azn: :azn: :azn:

I know it is an  unorthodox way, but i just wanted to test the functionality on the exact way as it is in the original dll (ucrtbase). To use regular parameters, i created another function to test like this:



[Float_AtanPiFactor: R$ (180/3.1415926535897932384626433832795)]
[Float360: R$ 360]
[Float_Zero: R$ 0]


pY(Input) = Pointer to a Double (Real8) value
pX(input) = Pointer to a Double (Real8) value
ConvDegree - A Flag to convert the output to degree or radian. True =  convert to degree. Radian, otherwise

Proc SSE_AtanEx:
    Arguments @pY, @pX, @ConvDegree
    Uses eax, ebx, ecx, edx

    mov ebx D@pY
    mov ecx D@pX

    movupd XMM0 X$ebx
    movupd XMM1 X$ecx
    call  Sse2_atan2  ;<------ call the main function

    .If D@ConvDegree = &TRUE
        movupd XMM1 X$Float_AtanPiFactor
        mulpd XMM0 XMM1
        movupd XMM1 X$Float360
        xorpd xmm2 xmm2
        SSE_D_If xmm0 < xmm2  ; Simple SSE comparition macro as explained earlier. COMISD XMM0 XMM2 | jnb D0> ......
            addpd xmm0 xmm1
        SSE_D_Else_If xmm0 > xmm1 ; Same as above
            subpd xmm0 xmm1
        SSE_D_End_If ; endpoint of the comparition macros. Those macros works the same as the regular If/Else/Else_If, but focused to SSE registers.
    .End_If

    mov eax &TRUE

EndP


Btw..later both functions can be joined to allow the usage of parameters on a single function, but i wanted to test the speed and accuracy of this routine 1st.
Title: Re: Atan2 SSE2
Post by: guga on February 09, 2022, 08:33:47 AM
Quote from: jj2007 on February 09, 2022, 08:10:55 AM
Quote from: guga on February 09, 2022, 07:30:32 AMDll "test"
Declare Sse2_atan2, 1

The ordinal value of the function inside the dll is 1 and not 2

I changed the name of the DLL to avoid ambiguities. Declare (https://www.jj2007.eu/MasmBasicQuickReference.htm#Mb1017) expects the number of args, but I found out debugging your DLL that you pass the two arguments in xmm0 and xmm1 (which is pretty unorthodox).

So, here are results for Sse2_atan2(1.0, 2.0) - as compared to ucrtbase atan2:

Intel(R) Core(TM) i5-2450M CPU @ 2.50GHz (SSE4)
6405    cycles for 100 * Sse2_atan2
35327   cycles for 100 * ucrtbase atan2

6407    cycles for 100 * Sse2_atan2
35314   cycles for 100 * ucrtbase atan2

6420    cycles for 100 * Sse2_atan2
35373   cycles for 100 * ucrtbase atan2

6426    cycles for 100 * Sse2_atan2
35292   cycles for 100 * ucrtbase atan2

105     bytes for Sse2_atan2
108     bytes for ucrtbase atan2

Real8   0.4636476090008099793   Sse2_atan2
Real8   1.570796326794896558    ucrtbase atan2


Results are different, so I probably got some arguments wrong :sad:

About the results being different, maybe it is reusing xmm0 and xmm1 when the functions are called.

Here it works ok if i only do this:


[TestingAtan2Data1: R$ 1.0]
[TestingAtan2Data2: R$ 2.0]

    movupd XMM0 X$TestingAtan2Data1
    movupd XMM1 X$TestingAtan2Data2
    call 'ucrtbase.__libm_sse2_atan2'


    movupd XMM0 X$TestingAtan2Data1
    movupd XMM1 X$TestingAtan2Data2
    call Sse2_atan2


xmm0 = 4.63647609000806093e-1 (ucrtbase)
xmm0 = 4.63647609000809979e-1 (mine version).

The small difference is due to the rounding error that i removed while i was testing the functionality. If you use the complete tangent table, then it should generate the same result. Since, i´m still testing it, i´m unsure how this whole thing works exactly.

I made some tests and placed the results of the tangent table on excel and it seems to be a sort of rotating table that the function is using. I´m trying to find if there is a direct relation between the table values and the resultant data, to see if the speed can be improved further, simply making pointers to the table and some small calculations. But, i still have no idea exactly how this function really works.
Title: Re: Atan2 SSE2
Post by: HSE on February 09, 2022, 08:34:42 AM
Impressive  :thumbsup:

Intel(R) Core(TM) i3-10100 CPU @ 3.60GHz (SSE4)

6783    cycles for 100 * Sse2_atan2
11995   cycles for 100 * fpatan

6783    cycles for 100 * Sse2_atan2
11999   cycles for 100 * fpatan

6798    cycles for 100 * Sse2_atan2
12005   cycles for 100 * fpatan

6786    cycles for 100 * Sse2_atan2
11989   cycles for 100 * fpatan

29      bytes for Sse2_atan2
20      bytes for fpatan

Real8   0.4636476090008099793   Sse2_atan2
Real8   0.4636476090008060935   fpatan


This include result storage.
Title: Re: Atan2 SSE2
Post by: guga on February 09, 2022, 08:58:21 AM
Tks HSE :azn: :azn: :azn:

Sorry for the syntax. I´m used to RosAsm that´s why i have some difficulties in port the code, but it seems to work as expected with minor differences then the original version. You can rewrite the code as you wish.I believe there´s a way to make it even faster, but it is necessary to understand what exactly the whole function is doing. It uses a table of tangents created on a certain way, displaced on 164 values, with a separation of 16 "indexes". I don´t know why it is using only 164 indexes rather then the 176  = 11 indexes *16 (closer to 180º). In excel there seems to be a sort of relation between the indexes and the resultants value, but i´m not sure yet.

Once the whole function is fully understand then, perhaps optimization could be done, forcing the function to stay in only 1 quadrant (90º) and identifying the changes of sign in other quadrants and simply add or subtract 360º from the result. Don´t know yet.

There are other functions inside ucrtbase that seems really good, but the problems is that the way they were written kills speed due to heavy spaghetti code. On the original libm_sse2_atan2, it contains jumps that go back and forward the entire function, and also some pointers to other frames (other functions) as well. What i did was reorganize the whole code and create the proper sub-functions avoiding the heavy usage of spaghetti code inside.
Title: Re: Atan2 SSE2
Post by: jj2007 on February 09, 2022, 09:22:07 AM
Hi Guga,

It seems you had an earlier version (I edited the post). Here is the latest, with HSE's fpatan addition:

Intel(R) Core(TM) i5-2450M CPU @ 2.50GHz (SSE4)

6923    cycles for 100 * Sse2_atan2
18360   cycles for 100 * ucrtbase atan2
11834   cycles for 100 * fpatan

6434    cycles for 100 * Sse2_atan2
18435   cycles for 100 * ucrtbase atan2
11476   cycles for 100 * fpatan

6430    cycles for 100 * Sse2_atan2
17775   cycles for 100 * ucrtbase atan2
11868   cycles for 100 * fpatan

6429    cycles for 100 * Sse2_atan2
17821   cycles for 100 * ucrtbase atan2
11493   cycles for 100 * fpatan

123     bytes for Sse2_atan2
127     bytes for ucrtbase atan2
110     bytes for fpatan

Real8   -56.30993247402027180   Sse2_atan2
Real8   -56.30993247402021495   ucrtbase atan2
Real8   -56.30993247402021495   fpatan
Title: Re: Atan2 SSE2
Post by: guga on February 09, 2022, 09:42:00 AM
Tks a lot, JJ

I´ll take a look. There is a small difference in the results, but it´s due to the rounding error that i removed earlier.

I´m trying to understand the math behind this, and why the function is using things like:
pextrw eax xmm0 3 | and eax 07FF0 | sub eax 03870
...

mov eax 0800 | pinsrw xmm5 eax 3

etc etc.

before it reaches what is really necessary to compute the atan2, i mean, before it do things like divsd xmm4 xmm1 etc.

If i could understand what´s going on before it reaches the SSe division, then maybe we could gain more speed, simply making the division of xmm0 / xmm1 at 1st and then do minor calculations before it reaches the tangent table to later adjust the remainder value.

For me, it´s a bit weird it´s using an error value in the 2nd Real8 on the tangent table, because the values of the 1st and 2nd Real8 are complementary, so, it seems useless calculating all of this feeding error values. What´s also weird is that the original version from ucrtbase and fpatan produces some smaller differences then the values found in wolfram-alpha, for example.
Title: Re: Atan2 SSE2
Post by: guga on February 09, 2022, 09:51:13 AM
Btw...follow the results i´ve got in your last version

:dazzled: :dazzled: :dazzled: :dazzled:

AMD Ryzen 5 2400G with Radeon Vega Graphics     (SSE4)

6931    cycles for 100 * Sse2_atan2
69442   cycles for 100 * ucrtbase atan2
14871   cycles for 100 * fpatan

7411    cycles for 100 * Sse2_atan2
69354   cycles for 100 * ucrtbase atan2
14886   cycles for 100 * fpatan

6994    cycles for 100 * Sse2_atan2
68109   cycles for 100 * ucrtbase atan2
14957   cycles for 100 * fpatan

7040    cycles for 100 * Sse2_atan2
67606   cycles for 100 * ucrtbase atan2
14731   cycles for 100 * fpatan

123     bytes for Sse2_atan2
127     bytes for ucrtbase atan2
110     bytes for fpatan

Real8   -56.30993247402027180   Sse2_atan2
Real8   -56.30993247402021495   ucrtbase atan2
Real8   -56.30993247402021495   fpatan

--- ok ---


Now i can try to make a small test on the routine i was doing for marinu´s Fir Algo and FFT Analyser to retrieve the value of phase of FFT Frequencies :)
Title: Re: Atan2 SSE2
Post by: HSE on February 10, 2022, 01:16:53 AM
Quote from: guga on February 09, 2022, 09:42:00 AM
then maybe we could gain more speed

The reason that SSE algorithm is fast than FPU is because have less precision. While FPU use 8-9 degree polinomies, SSE algorithms are using 5-6 degrees (if I remeber well some discussion somewhere in this forum).

If you can accept less precision perhaps You can use a faster algorithm.

Then perhaps You can build a hyperfaster low precision library for previews, and a best precision library for final works.

Title: Re: Atan2 SSE2
Post by: guga on February 10, 2022, 02:09:47 AM
I don´t know if the precision is less accurate. On the original version It seems to produce the same result as in FPU. I finished some minor optimizations both for speed and accuracy. Now it seems to result the same value as the original one, and perhaps a bit more fast.

The function (and the original) uses an approximation that seems to be biased in:
https://www.dsprelated.com/showarticle/1052.php

But, it works better, since it is more accurate then the results of the link. It is using a rotating table of tangents. On my last version i removed several useless SSE checks, including  the ones at the start of the function, replaced them with the regular opcodes counterpart (mov eax etc) and forced the routines to simply point to the proper masks. I´m currently trying to understand how exactly one of the internal function works, "Sse2_atan2Internal4", because this is where the table of tangents are being used. It gets a sort of index to calculate the value, and maybe we can create a similar binary search function as JJ did sometime ago to point to the indexes of the table or something like that.

I´m currently labeling the variables and forcing the function to use the proper parameters rather then xmm0/xmm1 directly and also creating another function to retrieve the resul in Angles too. Once i finish i´ll upload it here to we perform the benchmarks. :smiley: :smiley:
Title: Re: Atan2 SSE2
Post by: jj2007 on February 10, 2022, 04:11:21 AM
Quote from: guga on February 10, 2022, 02:09:47 AM...table of tangents are being used. It gets a sort of index to calculate the value, and maybe we can create a similar binary search function as JJ did sometime ago to point to the indexes of the table or something like that.

Quote from: jj2007 on February 07, 2022, 09:54:42 AMIf your problem is to find atan2(X,Y), and either X or Y is a constant, then FastMath (http://masm32.com/board/index.php?topic=8779.0) would be an option.
Title: Re: Atan2 SSE2
Post by: guga on February 10, 2022, 09:47:29 AM
Tks, JJ.

I couldn´t understand how to test FastMath. Also, isn´t it using the windows dll  (crt or ucrtbase) underneath ?

Btw...here goes a new version. I made some modifications as you suggested and make it work with regular parameters. One thing to consider when assembling this. The global variables Atan2_Mask1 to Atan2_Mask12 must be aligned to a 16 byte-boundary, otherwise the app will crash. This is because they are being used as variables to things like minsd, andpd etc

I added 4 functions (Renamed the previous function . "D" for double for easier comprehension):

SSE_Atan2DEx: Retrieves the atan2 of a value and also can convert the result to Degree or Radian.
Uses 3 parameters:
pY - Pointer to a Real8 value
pX - Pointer to a Real8 value
ConvDegree- A Flag (integer) to convert the result to degrees. TRUE = Degrees. FALSE = Radians.

Return Value: The resultant value is located in xmm0 register


SSE_Atan2D: Retrieves the atan2 of a value.
Uses 2 parameters:
pY - Pointer to a Real8 value
pX - Pointer to a Real8 value

Return Value: The resultant value is located in xmm0 register (In radians)

Atan2: Retrieves the atan2 of a value and also can convert the result to Degree or Radian. Same as SSE_Atan2DEx, but using FPU rather then SSE2
Uses 3 parameters:
pY - Pointer to a Real8 value
pX - Pointer to a Real8 value
ConvDegree- A Flag (integer) to convert the result to degrees. TRUE = Degrees. FALSE = Radians.

Return Value: The resultant value is located in ST0 register

CreateTangetTable This was for my tests only. It generates a table of tangents used in SSE_Atan2DEx function. The function uses no parameters, and the resultant values are stored in MyTangTBl variable
Title: Re: Atan2 SSE2
Post by: guga on February 10, 2022, 11:57:37 PM
Typo:

The SSE_Atan2D uses 2 parameters and not 3 as i wrote in the previous post. Post now edited :smiley:
Title: Re: Atan2 SSE2
Post by: jj2007 on February 11, 2022, 01:23:59 AM
Quote from: guga on February 10, 2022, 09:47:29 AM
I couldn´t understand how to test FastMath. Also, isn´t it using the windows dll  (crt or ucrtbase) underneath ?

No, it doesn't use crt or similar.

Quote from: jj2007 on February 07, 2022, 09:54:42 AMIf your problem is to find atan2(X,Y), and either X or Y is a constant, then FastMath (http://masm32.com/board/index.php?topic=8779.0) would be an option.

But FastMath is limited to one argument functions, that's why I asked what your usage of that function is. Anyway, I attach a new version that includes FastMath with a fixed second argument:

Intel(R) Core(TM) i5-2450M CPU @ 2.50GHz (SSE4)

6699    cycles for 100 * Sse2_atan2
18672   cycles for 100 * ucrtbase atan2
11874   cycles for 100 * fpatan
1568    cycles for 100 * FastAtan

6511    cycles for 100 * Sse2_atan2
18258   cycles for 100 * ucrtbase atan2
11638   cycles for 100 * fpatan
1426    cycles for 100 * FastAtan

6581    cycles for 100 * Sse2_atan2
18033   cycles for 100 * ucrtbase atan2
11706   cycles for 100 * fpatan
1477    cycles for 100 * FastAtan

6574    cycles for 100 * Sse2_atan2
18159   cycles for 100 * ucrtbase atan2
11986   cycles for 100 * fpatan
1450    cycles for 100 * FastAtan

123     bytes for Sse2_atan2
127     bytes for ucrtbase atan2
106     bytes for fpatan
269     bytes for FastAtan

Real8   -56.30993247402027180   Sse2_atan2
Real8   -56.30993247402021495   ucrtbase atan2
Real8   -56.30993247402021495   fpatan
Real8   -56.30993247402021495   FastAtan


Here is the FastMath usage, first the creation (once):
  FastMath FastAtan ; define a math function
For_ fct=-40.0 To 10.0 Step 0.5 ; max 10,000 iterations
fld fct ; X
fstp REAL10 ptr [edi]
fld fct
fld FP8(22.0)
fpatan    ; Masm32 algo
fstp REAL10 ptr [edi+REAL10]
add edi, 2*REAL10
Next
  FastMath


And here the usage in the loop, see TestD:

  mov ebx, AlgoLoops-1 ; loop e.g. 100x
  align 4
  .Repeat
void FastAtan(-33.0)
dec ebx
.if !Sign?
fstp st
.endif
  .Until Sign?


Timings for the creation of the FastMath table (it could be done many times in an outer loop for the second argument):

39 µs for creating FastMath
28 µs for creating FastMath
23 µs for creating FastMath
26 µs for creating FastMath
23 µs for creating FastMath
25 µs for creating FastMath
23 µs for creating FastMath
25 µs for creating FastMath
23 µs for creating FastMath
23 µs for creating FastMath
Title: Re: Atan2 SSE2
Post by: guga on February 11, 2022, 02:00:17 AM
WOW !

Tks JJ. But, i think you used the old version. The new one (That uses parameters) is the one attached previously (also attached here). The name of the function is SSE_Atan2DEx (Uses 3 parameters to convert to degree) or SSE_Atan2D (Using only 2 parameters. No degree conversion)

I tried to assemble your file with RichMasm, but got this msg

*** Start C:\masm32\MasmBasic\Res\bldallRM.bat ***
*** 32-bit assembly ***


*** Assemble, link and run GugaAtan2 ***

*** Assemble using ml  ***
Assembling: tmp_file.asm
##########################

You CANNOT use the MasmBasic library with
the old ml.exe version 6.14 - use UAsm instead

##########################
tmp_file.asm(13) : error A2008: syntax error : StackWalk
*** MasmBasic version 18.02.2021 ***
*** Warning: SQWORD is unsigned with ml **
\masm32\MasmBasic\MasmBasic.inc(728) : error A2052: forced error
TestMasmVersion(8): Macro Called From
  \masm32\MasmBasic\MasmBasic.inc(728): Include File
*** Assembly error ***



I do have uasm installed in the masm directory "C:\masm32\bin\uasm64.exe". How do i configure richmasm to assemble the file using uasm, so i can try to assemble the file as well ?
Title: Re: Atan2 SSE2
Post by: jj2007 on February 11, 2022, 02:43:58 AM
Quote from: guga on February 11, 2022, 02:00:17 AMI tried to assemble your file with RichMasm, but got this msg

Oops, my fault:
- comment out StackWalk (http://masm32.com/board/index.php?topic=9803.msg107496#msg107496) in line 13 (that's a new macro, not yet included in the current MasmBasic version)
- use OPxT_Assembler at the end of the file (the x disables the option; your ml.exe is the old 6.14)

New version below. Creating the FastMath table for a new second argument costs 12 microseconds on average. You still have not told us how you will use this; since you are interested in speed, I suppose you have an innermost plus some outer loop. Can the outer loop work with a 12 microseconds delay?

12 us on average for creating FastMath
Intel(R) Core(TM) i5-2450M CPU @ 2.50GHz (SSE4)

6417    cycles for 100 * SSE_Atan2D_new
18184   cycles for 100 * ucrtbase atan2
11387   cycles for 100 * fpatan
1438    cycles for 100 * FastAtan

6273    cycles for 100 * SSE_Atan2D_new
18093   cycles for 100 * ucrtbase atan2
11382   cycles for 100 * fpatan
1398    cycles for 100 * FastAtan

6171    cycles for 100 * SSE_Atan2D_new
18259   cycles for 100 * ucrtbase atan2
11396   cycles for 100 * fpatan
1398    cycles for 100 * FastAtan

6378    cycles for 100 * SSE_Atan2D_new
18089   cycles for 100 * ucrtbase atan2
11382   cycles for 100 * fpatan
1398    cycles for 100 * FastAtan

49      bytes for SSE_Atan2D_new
51      bytes for ucrtbase atan2
30      bytes for fpatan
193     bytes for FastAtan

Real8   -56.30993247402021495   SSE_Atan2D_new
Real8   -56.30993247402021495   ucrtbase atan2
Real8   -56.30993247402021495   fpatan
Real8   -56.30993247402021495   FastAtan
Title: Re: Atan2 SSE2
Post by: guga on February 11, 2022, 04:04:10 AM
Hi JJ.
Many Tks. I´ll take a look

(Post edited), i didn´t saw the file you uploaded when i was writing this comment.
Title: Re: Atan2 SSE2
Post by: guga on February 11, 2022, 04:41:31 AM
My results with the new version. A bit faster and accurated :azn: :azn:


31 us on average for creating FastMath
AMD Ryzen 5 2400G with Radeon Vega Graphics     (SSE4)

5123    cycles for 100 * SSE_Atan2D_new
67818   cycles for 100 * ucrtbase atan2
14832   cycles for 100 * fpatan
1479    cycles for 100 * FastAtan

5166    cycles for 100 * SSE_Atan2D_new
67255   cycles for 100 * ucrtbase atan2
14665   cycles for 100 * fpatan
1523    cycles for 100 * FastAtan

5029    cycles for 100 * SSE_Atan2D_new
68465   cycles for 100 * ucrtbase atan2
15228   cycles for 100 * fpatan
1766    cycles for 100 * FastAtan

5076    cycles for 100 * SSE_Atan2D_new
67520   cycles for 100 * ucrtbase atan2
14604   cycles for 100 * fpatan
1568    cycles for 100 * FastAtan

49      bytes for SSE_Atan2D_new
51      bytes for ucrtbase atan2
30      bytes for fpatan
193     bytes for FastAtan

Real8   -56.30993247402021495   SSE_Atan2D_new
Real8   -56.30993247402021495   ucrtbase atan2
Real8   -56.30993247402021495   fpatan
Real8   -56.30993247402021495   FastAtan



Btw, i´m  impressed about the results of FastAtan. I analysed it in the disassembler and got a bit clueless how it is showing the correct result since it uses only 1 parameter (value: 33) and not the other one. The FastMath library is being applied before the computations, right ? Because at


.text:00401201 UsedInFastAtan:                         ; CODE XREF: MainRoutine+418↓j
.text:00401201                 fld     tbyte_408068 <- 0.22 =  22/100 (Our 1st parameter Y) ?
.text:00401207                 fld     tbyte_409992
.text:0040120D                 faddp   st(1), st
.text:0040120F                 fstp    tbyte_409992
.text:00401215                 push    ecx
.text:00401216                 fxsave  dword_409B50
.text:0040121D                 push    0               ; dwMilliseconds
.text:0040121F                 call    Sleep
.text:00401224                 push    offset PerformanceCount ; lpPerformanceCount
.text:00401229                 call    QueryPerformanceCounter
.text:0040122E                 fxrstor dword_409B50
.text:00401235                 pop     ecx
.text:00401236                 push    ecx
.text:00401237                 push    edi
.text:00401238                 push    0C3500h         ; dwBytes
.text:0040123D                 push    4               ; dwFlags
.text:0040123F                 push    hHeap           ; hHeap
.text:00401245                 call    HeapAlloc
.text:0040124A                 mov     edi, eax
.text:0040124C                 mov     dword_4098E0, edi
.text:00401252                 push    edi
.text:00401253                 add     edi, 61A80h
.text:00401259                 push    edi
.text:0040125A                 ffree   st(7)
.text:0040125C                 fld     tbyte_40807C
.text:00401262                 ffree   st(7)
.text:00401264                 fld     st
.text:00401266                 fstp    tbyte_409988
.text:0040126C                 fld     tbyte_408072
.text:00401272                 fsubrp  st(1), st
.text:00401274                 fld     tbyte_408086
.text:0040127A                 fdivp   st(1), st
.text:0040127C                 mov     edx, offset dword_4099A4
.text:00401281                 fmul    dbl_408090
.text:00401287                 fisttp  dword ptr [edx]
.text:00401289                 fld     tbyte_408086
.text:0040128F                 fimul   dword ptr [edx]
.text:00401291                 fld     tbyte_409988
.text:00401297                 faddp   st(1), st
.text:00401299                 fstp    tbyte_4099FA
.text:0040129F                 cmp     dword_4099A4, 0
.text:004012A6                 js      short loc_4012EA
.text:004012A8
.text:004012A8 loc_4012A8:                             ; CODE XREF: MainRoutine+2C4↓j
.text:004012A8                 fld     tbyte_409988
.text:004012AE                 fstp    tbyte ptr [edi]
.text:004012B0                 fld     tbyte_409988
.text:004012B6                 fld     tbyte_409992
.text:004012BC                 fpatan <---------------Prwcalculated


if i understood correctly, all the computation also uses fpatan opcode, but it is precalculated before the calculations of the timming. So, for the resultant timming be more acccurated, wouldn´t be necessary to add to the final timming of "void FastAtan(-33.0)", the timming it took to precalculate the data. I.e: the fpatan timming as well ?
Title: Re: Atan2 SSE2
Post by: guga on February 11, 2022, 05:59:40 AM
About the usage. A faster Atan2 is needed to compute more data related to a FFT from audio samples. I'm testing Marinus FirAnalyser and FFTAnalyser in order to create a app that can export ta audio filtered with Fir algorithm or other sort of algorithm that uses FFT too.

http://masm32.com/board/index.php?topic=4231.135
http://masm32.com/board/index.php?topic=3568.msg39209#msg39209
http://masm32.com/board/index.php?topic=3665.15

I´m currently trying to finish a routine to save/export an audio filtered with Marinu´s Fir Algorithm. Later i´ll try to do a similar spectrogram as he did using FFT routines, but instead only calculating the Real or Imaginary frequencies, i wanted to export everything related to an audio file/sample, such as amplitude, phase (that uses atan2 function), pitch identification, Sound Intensity, Sound Power level and finally estimate the distance of the sound and the microphone that recorded it and created the sample (estimating the distance of each frequency can be handy to isolate them or identify backgrounds etc etc).

So, input = audio sample
Output = A structure containing everything related to that sample (or sequence of samples).

Doing that, with all information related to a certain audio file, we can build better tools to identify pitch, noise, isolating vocals etc etc using techniques similar to facebook demucs, or facebook wave2text etc etc. Or also creating a tool that can act as a photoshop, but for audio. example, once we know the frequencies the generated data is placed on a Spectrogram (such as the one existent in Isotope RX using Mel Scale), then we can simply work on the audio file on the same way we do for images. We can identify vocals using only the generated image rather then the heavy computations that involves audio etc


His FFT analyzer has a nice spectrogram but i wanted something like this

(https://imgkub.com/images/2022/02/11/Untitled.th.jpg) (https://imgkub.com/image/Jz0Pp)

Isotope is an extraordinary audio editing tool, but lack some features, such as phase identification, and sound separation (such as facebook´s one - demucs) etc etc

Title: Re: Atan2 SSE2
Post by: jj2007 on February 11, 2022, 06:26:54 AM
Quote from: guga on February 11, 2022, 04:41:31 AMif i understood correctly, all the computation also uses fpatan opcode, but it is precalculated before the calculations of the timming. So, for the resultant timming be more acccurated, wouldn´t be necessary to add to the final timming of "void FastAtan(-33.0)", the timming it took to precalculate the data. I.e: the fpatan timming as well ?

"31 µs on average for creating FastMath"

That's why I asked (repeatedly) what your requirements are. If there are two nested loops, with X as the first and Y as the second counter, then the outer loop could re-create the FastMath table (at 12µs on my, 30µs on yours), while the inner loop just uses it.
Title: Re: Atan2 SSE2
Post by: guga on February 15, 2022, 10:27:40 AM
Hi Guys

I succeeded to find the mathematical equation behind the Atan2 function. It uses a variation of a Taylor Series on the 4th degree, and seems to use some fractions derived from a polynomial calculation (Chebyshev ?)

The polynomial used is this:

0.9980475428149416 * (Y^9)/9 - 0.9999987497055115 * (Y^7)/7 + 0.9999999997101690 * (Y^5)/5 - 0.9999999999999822 * (Y^3)/3 + Y

Where Y = A value resultant from the division of xmm0/xmm4

The divisor is calculated as:
-k + (1 + k^2)/(k + z)

where
k = X/Y = Inputed XValue / Inputed YValue
z = The position on the table that decrements each 16 values. This "position" is, in fact a arctangent of a value starting from 32 to 0.029296875000 (From bottom to top on the table)

I´m making some tests using JJ´s technique he did sometime ago on the binary search algo.

I´ll later try to make further tests before implement a newer routine to check for speed. So, far, i succeeded to remove several repeated usage of xmm registers. I´ll try to limit the number of registers, once i suceed to decode this Taylor series and see if it will work on all situations.

Title: Re: Atan2 SSE2
Post by: guga on February 20, 2022, 01:37:11 AM
Hi Guys

I succeeded to simplify the math , but i have no idea, why the original version is faster then the one i´m testing. Since it is using less registers and i removed some opcodes, it should be faster. But, for some reason, the original one, is achieving around 5000 clocks only, while the other version i´m testing has only 6500 clocks, even using less registers. Both results are ok, but, what i´m doing wrong ? Why this new version is slower even using less registers and less opcodes ? The only difference in the new version is on a routine, i called Sse2_atan2_ComputePolynomial. Is there a cache issue in SSE2 registers that i´m, not aware ?

Btw: I also tested with JJ´s Binary search routine, but it was unsuccessful. The speed results is the same as he version that does not use the loop.

The original version of thee routine is this:


Proc Sse2_atan2_ComputePolynomial:
    Arguments @pY, @pX
    Local @StartBlockIndexX, @StartBlockIndexY, @MaskedIndex
    Uses ecx

    mov eax D@pY | movupd xmm0 X$eax; TestingAtan2Data1
    mov eax D@pX | movupd xmm1 X$eax; TestingAtan2Data2

    unpcklpd xmm0 xmm1 ; xmm1 = SSEAtanTmpVal2
    movmskpd eax xmm0 | add eax eax | mov D@MaskedIndex eax

    xorpd xmm3 xmm3
    movupd xmm5 X$Atan2_DivPower895 ; 1/2^895 same as: xorpd xmm5 xmm5 | mov eax 0800 | pinsrw xmm5 eax 3
    paddw xmm5 xmm1
    psrlq xmm5 29
    rcpss xmm3 xmm5
    psllq xmm3 29

    paddw xmm3 X$Atan2_DivPower127 ; xmm3+(1/(2^127)) Warning. The data at this variable must be aligned to 16 bytes
    mulsd xmm3 xmm0 ; Multiply this result with the unpacked data from SSEAtanTmpVal2

    paddd xmm3 X$Atan2_Mask4
    andpd xmm3 X$Atan2_Mask3 ;  get the value of our table = tan(ValueUnknown) = 1.125 etc

    movsd xmm5 xmm3 ; xmm5 used to get the index for out tangent table
    mov eax 08000 | pextrw eax xmm3 3 | and eax 07FF0 | sub eax 03F9E | mov D@StartBlockIndexX eax | add eax 942 | mov D@StartBlockIndexY eax


    minsd xmm3 X$Atan2_Float_32

    SSE_ABS_REAL8 xmm0 ; Convert to positive. This macro is the same as below

    ;psllq xmm0 1
    ;psrlq xmm0 1    ; sign bit of pY was removed (IF any)

    cmplesd xmm5 X$Atan2_Float_32

    SSE_ABS_REAL8 xmm1 ; Convert to positive. This macro is the same as below

;    psllq xmm1 1    ; sign bit of pX was removed (IF any)
;    psrlq xmm1 1 ; contains the value of pX TESTINGATAN2DATA2

    movsd xmm6 xmm1 ; xmm6 = pX
    movsd xmm7 xmm1 ; xmm7 = pX

    movsd xmm4 xmm0 | mulsd xmm4 xmm3 | andpd xmm1 xmm5 | addsd xmm4 xmm1

    mov ecx 0 | pinsrw xmm6 ecx 0 | subsd xmm7 xmm6
    mulsd xmm6 xmm3 | mulsd xmm7 xmm3 | andpd xmm0 xmm5 | subsd xmm0 xmm6
    subsd xmm0 xmm7

    ; TangentTbl2
    .If D@StartBlockIndexX <= 1121 ; used xmm4, xmm0, xmm5, xmm3

        ; Get tangent index and put in ecx
        mov eax D@StartBlockIndexX | pextrw eax xmm5 0 | not eax | and eax 1
        mov ecx D@StartBlockIndexY | pextrw ecx xmm3 3 | sub ecx 03F9E | add ecx eax | add ecx ecx
        movupd xmm5 X$TangentTbl+ecx*8

        mov eax D@MaskedIndex

        movupd xmm6 X$Atan2_P_TBL+eax*8
        movupd xmm1 X$Atan2_SGN_TBL+eax*8
        xorpd xmm5 xmm1 | addpd xmm5 xmm6

        movsd xmm6 xmm5
        unpckhpd xmm5 xmm5

        divsd xmm0 xmm4 ; we are dividing pX/pY if abs(px) < abs(py)
        ; And do the polynomial routine. A Taylor series of 4th degree of the tangent math
        xorpd xmm1 xmm0
        movsd xmm4 xmm1
        mulsd xmm0 xmm0 ; Y = Y^2

        movlpd xmm2 X$Atan2_FactorA9 ; FactorA9
        mulsd xmm2 xmm0 ; A9*Y = FactorA9*Y

        movlpd xmm3 X$Atan2_Limit3
        addsd xmm3 xmm0 ; Atan2_Limit3+Y
        addsd xmm1 xmm6
        subsd xmm6 xmm1
        addsd xmm6 xmm4
        addsd xmm2 X$Atan2_Limit2 ; FactorA9*Y+Atan2_Limit2
        mulsd xmm3 xmm0 ; (Atan2_Limit3+Y)*Y
        mulsd xmm4 xmm0
        addsd xmm6 xmm5
        mulsd xmm2 xmm4
        addsd xmm3 X$Atan2_Limit4 ;  Atan2_Limit4 + (Atan2_Limit3+Y)*Y
        mulsd xmm2 xmm3
        addsd xmm2 xmm6
        addsd xmm1 xmm2 ; xmm1 = -HalfPi, HalfPi or 0
        movupd xmm0 xmm1

    .Else_If D@StartBlockIndexY <= 942;03AE ; Simple_Polynomial used xmm1, xmm3, xmm6, xmm0

        mov eax D@pY | movupd xmm2 X$eax | SSE_ABS_REAL8 xmm2

        movupd xmm4 X$Atan2_Float_One ; same as: xorpd xmm4 xmm4 | mov ecx 03FF0 | pinsrw xmm4 ecx 3
        divsd xmm4 xmm1

        mov eax D@MaskedIndex
        movupd xmm6 X$Atan2_SGN_TBL+eax*8
        unpcklpd xmm3 xmm3
        xorpd xmm0 xmm6
        xorpd xmm2 xmm6
        xorpd xmm3 xmm6
        movupd xmm7 X$Atan2_P_TBL2+eax*8
        movlpd xmm1 X$Atan2_FactorA9
        movlpd xmm5 X$Atan2_Limit3
        andpd xmm3 X$Atan2_SELECT_B+eax*8
        mulsd xmm2 xmm4
        mulsd xmm0 xmm4
        movsd xmm6 xmm2
        mulsd xmm2 xmm2
        mulsd xmm1 xmm2
        addsd xmm5 xmm2
        mulsd xmm6 xmm2
        addsd xmm1 X$Atan2_Limit2
        mulsd xmm5 xmm2
        addsd xmm7 xmm0
        addpd xmm7 xmm3
        mulsd xmm1 xmm6
        addsd xmm5 X$Atan2_Limit4
        mulsd xmm5 xmm1
        addsd xmm5 xmm7
        unpckhpd xmm7 xmm7
        addsd xmm5 xmm7
        movupd xmm0 xmm5

    .Else ; Big_or_Small
        call Sse2_atan2_Big_or_Small
    .End_If

EndP



Newer version is:


Proc Sse2_atan2_ComputePolynomial:
    Arguments @pY, @pX
    Local @StartBlockIndexX, @StartBlockIndexY, @PosInTable
    Uses esi, ecx

    movupd xmm1 X$Atan2_DivPower895 ; 1/2^895 same as: xorpd xmm5 xmm5 | mov eax 0800 | pinsrw xmm5 eax 3
    mov eax D@pX | movupd xmm0 X$eax
    paddw xmm1 xmm0
    psrlq xmm1 29
    xorpd xmm2 xmm2
    rcpss xmm2 xmm1
    psllq xmm2 29

    paddw xmm2 X$Atan2_DivPower127 ; xmm3+(1/(2^127)) Warning. The data at this variable must be aligned to 16 bytes
    mov eax D@pY | movupd xmm0 X$eax
    mulsd xmm2 xmm0 ; Multiply this result with the unpacked data from SSEAtanTmpVal2

    paddd xmm2 X$Atan2_Mask4
    andpd xmm2 X$Atan2_Mask3 ;  get the value of our table = tan(ValueUnknown) = 1.125 etc

    movsd xmm1 xmm2 ; xmm5 used to get the index for out tangent table
    pextrw eax xmm2 3 | and eax 07FF0 | sub eax 16286 | mov D@StartBlockIndexX eax | add eax 942 | mov D@StartBlockIndexY eax

    minsd xmm2 X$Atan2_Float_32

    cmplesd xmm1 X$Atan2_Float_32

    .If D@StartBlockIndexX <= 1121
        ; Get tangent index and put in ecx

        pextrw eax xmm1 0 | not eax | and eax 1
        pextrw ecx xmm2 3 | sub ecx 16286 | add ecx eax | add ecx ecx | mov D@PosInTable ecx

        ; we already have Y in xmm0, so, let´s put X on xmm1
        mov eax D@pX | movupd xmm1 X$eax

        ; get our masked index
        unpcklpd xmm0 xmm1 | movmskpd ecx xmm0 | add ecx ecx; | mov D@MaskedIndex eax

        SSE_ABS_REAL8 xmm0 ; Turn Y in Positive
        SSE_ABS_REAL8 xmm1 ; Turn X in Positive
        ; xmm2 already have our tangent pos index
        pshufd xmm2 xmm2 {SHUFFLE 1,0,1,0} ; we are simply coying from LowQuadrord to Hi Quaddword
        mulsd xmm2 xmm0 ; Y*z
        addsd xmm2 xmm1 ; X+Y*z
        pshufd xmm2 xmm2 SSE_SWAP_QWORDS ; swap (X + Y z)in xmm2, to we reuse it
        mulsd xmm1 xmm2; X*z
        subsd xmm0 xmm1; Y - X z
        movhlps xmm1 xmm2 ; copy (X + Y z) in the hiqword of xmm2 to low Qword in xmm1
        divsd xmm0 xmm1 ; finally get our divisor

        ; ----------------------- Start Calculating Polynomial - Taylor Series of Atan
        pshufd XMM0 XMM0 {SHUFFLE 1,0,1,0} ; we are simply coying from LowQuadrord to Hi Quaddword =    PSHUFD XMM0 XMM0 044h
        mulsd xmm0 xmm0 ; the lowquadword will be a power of our divisor Y^4

        movsd xmm1 xmm0 ; Copy the double Y to xmm1
        ; 1St Step. Calculate C = (0.1108941714238824 Y^2 - 0.1428569642436445)*Y^2
        mulsd xmm1 X$Atan2_FactorA9
        addsd xmm1 X$Atan2PolynomialFactor1A
        mulsd xmm1 xmm0

        ; 2nd step. Calculate B = Y^2*(C+0.1999999999420338)
        addsd xmm1 X$Atan2PolynomialFactor2A
        mulsd xmm1 xmm0

        ; 3rd step. Calculate A = Y^2*(B - 0.3333333333333274)
        addsd xmm1 X$Atan2PolynomialFactor3A
        mulsd xmm1 xmm0

        ; 4th step. Calculate Y*(A+1)
        pshufd xmm0 xmm0 SSE_SWAP_QWORDS ; swap Y <-- Y^2 in xmm0 =  PSHUFD XMM0 XMM0 78 (decimal)
        addsd xmm1 X$Atan2_Float_One
        mulsd xmm0 xmm1 ; result in xmm0

        mov eax D@PosInTable
        addsd xmm0 X$TangentTbl+eax*8+TangenTbl.MainValueDis
        addsd xmm0 X$TangentTbl+eax*8+TangenTbl.RemainderValueDis

        ; Now we can safelly apply our mask TestingAtan2Data1
        xorpd xmm0 X$Atan2_SGN_TBLA+ecx*8 | addsd xmm0 X$Atan2_P_TBLA+ecx*8 | addsd xmm0 X$Atan2_P_TBLA+ecx*8+8


    .Else_If D@StartBlockIndexY <= 942

        movsd xmm2 X$Atan2ClosetoZeroIndex ;esi+TangenTbl.TangPosDis = 0.029296875 (Fixed value when the value is closer to zero or near Pi)
        ; got our true pointer

        ; No need for index here. We need only to divide x/y
        ; we already have Y in xmm0, so, let´s put X on xmm1
        movups xmm1 X$Atan2_Float_One
        mov eax D@pX | movupd xmm1 X$eax ; Get X and copy it to xmm0
        SSE_ABS_REAL8 xmm0 ; Turn Y in Positive
        SSE_ABS_REAL8 xmm1 ; Turn X in Positive
        divsd xmm0 xmm1

        ; get our masked index
        unpcklpd xmm0 xmm1 | movmskpd ecx xmm0 | add ecx ecx; | mov D@MaskedIndex eax

        ; ----------------------- Start Calculating Polynomial - Taylor Series of Atan
        pshufd XMM0 XMM0 {SHUFFLE 1,0,1,0} ; we are simply coying from LowQuadrord to Hi Quaddword
        mulsd xmm0 xmm0 ; the lowquadword will be a power of our divisor Y^4
        ; -9.05387052965142591e-4   8.19725715676905915381763684962193281e-7

        movsd xmm1 xmm0 ; Copy the double Y to xmm1
        ; 1St Step. Calculate C = (0.1108941714238824 Y^2 - 0.1428569642436445)*Y^2
        mulsd xmm1 X$Atan2_FactorA9 | addsd xmm1 X$Atan2PolynomialFactor1A | mulsd xmm1 xmm0

        ; 2nd step. Calculate B = Y^2*(C+0.1999999999420338)
        addsd xmm1 X$Atan2PolynomialFactor2A | mulsd xmm1 xmm0

        ; 3rd step. Calculate A = Y^2*(B - 0.3333333333333274)
        addsd xmm1 X$Atan2PolynomialFactor3A | mulsd xmm1 xmm0

        ; 4th step. Calculate Y*(A+1)
        pshufd xmm0 xmm0 SSE_SWAP_QWORDS ; swap Y <-- Y^2 in xmm0
        addsd xmm1 X$Atan2_Float_One | mulsd xmm0 xmm1 ; result in xmm0

        xorpd xmm0 X$Atan2_SGN_TBLA+ecx*8 | addsd xmm0 X$Atan2_P_TBLA+ecx*8 | addsd xmm0 X$Atan2_P_TBLA+ecx*8+8


    .Else ; Big_or_Small
        call Sse2_atan2_Big_or_Small
    .End_If

EndP

Title: Re: Atan2 SSE2
Post by: jj2007 on February 20, 2022, 10:57:27 AM
Hi Guga,

I'm a bit lost - what do you mean with "JJ´s Binary search routine"?
Title: Re: Atan2 SSE2
Post by: guga on February 20, 2022, 06:08:31 PM
Quote from: jj2007 on February 20, 2022, 10:57:27 AM
Hi Guga,

I'm a bit lost - what do you mean with "JJ´s Binary search routine"?

Hi JJ

This one  :azn: :azn:

http://masm32.com/board/index.php?topic=7659.msg83723#msg83723

You helped some  time ago on a image routine i was testing.
Title: Re: Atan2 SSE2
Post by: jj2007 on February 20, 2022, 09:00:15 PM
Quote from: guga on February 20, 2022, 06:08:31 PM
Quote from: jj2007 on February 20, 2022, 10:57:27 AM
Hi Guga,

I'm a bit lost - what do you mean with "JJ´s Binary search routine"?

Hi JJ

This one  :azn: :azn:

http://masm32.com/board/index.php?topic=7659.msg83723#msg83723

You helped some  time ago on a image routine i was testing.

Yes, now I remember. It's called ArrayIndex (https://www.jj2007.eu/MasmBasicQuickReference.htm#Mb1179) now.
Title: Re: Atan2 SSE2
Post by: HSE on February 21, 2022, 12:05:14 AM
Hi Guga!

Quote from: guga on February 20, 2022, 01:37:11 AM
Since it is using less registers and i removed some opcodes, it should be faster.

That is a bad assumption!

See instruccion timing (http://masm32.com/board/index.php?topic=9666.msg106081#msg106081).

Regards, HSE.
Title: Re: Atan2 SSE2
Post by: guga on February 23, 2022, 01:00:03 PM
Quote from: HSE on February 21, 2022, 12:05:14 AM
Hi Guga!

Quote from: guga on February 20, 2022, 01:37:11 AM
Since it is using less registers and i removed some opcodes, it should be faster.

That is a bad assumption!

See instruccion timing (http://masm32.com/board/index.php?topic=9666.msg106081#msg106081).

Regards, HSE.

Indeed. You are right. Tks, HSE :thumbsup: I´ll need to stay with the original version instead. Or...another solution is use the LolRemez polynomials to speed it up a bit, but i´m not sure about accuracy yet. I´ll give a try converting aa app made in FreeBasic that do this LolRemez polynomials to be studied. The only problem is that i´ll also need to try to create a dll for FreeBasic to this thing works properly (Ouch.). If i see that the FreeBasic converstion will take too many time, i´ll give up and try to create a Lolremez app by scratch to test the accuracy for the algorithm. Btw..a LolRemez version for masm (or RosAsm, Fasm, nasm etc) is utterly needed for we test other ways to create faster routines that uses complex math.
Title: Re: Atan2 SSE2
Post by: jj2007 on February 23, 2022, 10:53:37 PM
Quote from: guga on February 23, 2022, 01:00:03 PMcreate faster routines that uses complex math.

If you give me examples what you need, I can try with FastMath() (https://www.jj2007.eu/MasmBasicQuickReference.htm#Mb1191).
Title: Re: Atan2 SSE2
Post by: guga on March 02, 2022, 08:23:25 AM
Hu JJ

I don´t remember right now any working example. I´ll take a look later, ok ? The goal is create a tool that allow us to do complex calculations on a faster way using this lolremez technique (or other) that can result on approximation values, like the ones we discussed some time ago with Siekmansk at: http://masm32.com/board/index.php?topic=8734.msg95511#msg95511

So, find the polynomials values for Exp(x) = Exp2(x / Logn(2.0)) etc or more complex maths like:

(x^2/log(y)) + sin(x)*atan2(y) etc etc

With lolremez we can produce things like:


lolremez -d 10 -r 0:1 "2**x" "2**x"

// Approximation of f(x) = 2**x
// with weight function g(x) = 2**x
// on interval [ 0, 1 ]
// with a polynomial of degree 10.
double f(double x)
{
    double u = 9.9522144894077596e-9;
    u = u * x + 9.4609455637620191e-8;
    u = u * x + 1.331271998884894e-6;
    u = u * x + 1.5244659656760603e-5;
    u = u * x + 1.5403957490414841e-4;
    u = u * x + 1.3333543730974749e-3;
    u = u * x + 9.6181294091749148e-3;
    u = u * x + 5.5504108628244985e-2;
    u = u * x + 2.402265069613608e-1;
    u = u * x + 6.9314718055989127e-1;
    return u * x + 1.0000000000000002;
}

.const
Chebylog2E      real4 4 dup (1.44269504089) ; 1/Logn(2.0)

Cheby10Exp0  real4 4 dup (9.9522144894077596e-9)
Cheby10Exp1  real4 4 dup (9.4609455637620191e-8)
Cheby10Exp2  real4 4 dup (1.331271998884894e-6)
Cheby10Exp3  real4 4 dup (1.5244659656760603e-5)
Cheby10Exp4  real4 4 dup (1.5403957490414841e-4)
Cheby10Exp5  real4 4 dup (1.3333543730974749e-3)
Cheby10Exp6  real4 4 dup (9.6181294091749148e-3)
Cheby10Exp7  real4 4 dup (5.5504108628244985e-2)
Cheby10Exp8  real4 4 dup (2.402265069613608e-1)
Cheby10Exp9  real4 4 dup (6.9314718055989127e-1)
Cheby10Exp10  real4 4 dup (1.0000000000000002)


.code
align 16
SSE2_Exp_10:  ; in: xmm0, out: xmm1
    movaps      xmm2,oword ptr Chebylog2E
    mulps       xmm2,xmm0
    psrld       xmm0,31
    cvttps2dq   xmm1,xmm2
    psubd       xmm1,xmm0
    movdqa      xmm0,xmm1
    cvtdq2ps    xmm1,xmm1
    subps       xmm2,xmm1
    movaps      xmm1,oword ptr Cheby10Exp0
    pslld       xmm0,23
    mulps       xmm1,xmm2
    addps       xmm1,oword ptr Cheby10Exp1
    mulps       xmm1,xmm2
    addps       xmm1,oword ptr Cheby10Exp2
    mulps       xmm1,xmm2
    addps       xmm1,oword ptr Cheby10Exp3
    mulps       xmm1,xmm2
    addps       xmm1,oword ptr Cheby10Exp4
    mulps       xmm1,xmm2
    addps       xmm1,oword ptr Cheby10Exp5
    mulps       xmm1,xmm2
    addps       xmm1,oword ptr Cheby10Exp6
    mulps       xmm1,xmm2
    addps       xmm1,oword ptr Cheby10Exp7
    mulps       xmm1,xmm2
    addps       xmm1,oword ptr Cheby10Exp8
    mulps       xmm1,xmm2
    addps       xmm1,oword ptr Cheby10Exp9
    mulps       xmm1,xmm2
    addps       xmm1,oword ptr Cheby10Exp10
    paddd       xmm0,xmm1
    ret


So, a app where we can input a math formula and result the full code with the polynomials already computed can be very handy
Title: Re: Atan2 SSE2
Post by: jj2007 on March 02, 2022, 08:29:15 AM
If you can bring (x^2/log(y)) + sin(x)*atan2(y) into the form y=f(x), I can help you (I'm not good at math)
Title: Re: Atan2 SSE2
Post by: guga on March 02, 2022, 09:09:07 AM
Let´s try it on a more simple function , then.

Try this:

sin(x)+cos(x)

which is the same as:

sqrt(2) *  sin(x + Pi/4)



On this example, LolRemez app returned:


G:\HD1_Fino\BackupHD IDE - ok\RosAsm\RosAsm\Development\Dlls\New\LolRemezOld>lolremez.exe sin(x)+cos(x)
// Approximation of f(x) = sin(x)+cos(x)
// on interval [ -1, 1 ]
// with a polynomial of degree 4.
// p(x)=(((3.9297499716796932e-2*x-1.5653249505225839e-1)*x-4.9891262416895649e-1)*x+9.9750048049840423e-1)*x+9.9991743032029927e-1
double f(double x)
{
    double u = 3.9297499716796932e-2;
    u = u * x + -1.5653249505225839e-1;
    u = u * x + -4.9891262416895649e-1;
    u = u * x + 9.9750048049840423e-1;
    return u * x + 9.9991743032029927e-1;
}

G:\HD1_Fino\BackupHD IDE - ok\RosAsm\RosAsm\Development\Dlls\New\LolRemezOld>pause
Press any key to continue . . .
Title: Re: Atan2 SSE2
Post by: jj2007 on March 02, 2022, 09:54:08 AM
Quote from: guga on March 02, 2022, 09:09:07 AM
Try this:

sin(x)+cos(x)

  FastMath SinPlusCos
For_ ct=-1 To 360 ; max 10,000 iterations
fild ct ; X
fld st
fstp REAL10 ptr [edi]
fmul FP8(0.01745329252) ; degrees->rad
fsincos
fadd
fstp REAL10 ptr [edi+REAL10]
add edi, 2*REAL10
Next
  FastMath


NameA equ SinPlusCos FastMath
TestA proc
  mov ebx, AlgoLoops-1 ; loop e.g. 100x
  xor ecx, ecx
  align 4
  .Repeat
void SinPlusCos(ecx) ; uses FastMath
fstp res8
inc ecx
  .Until ecx>=360
  ret
TestA endp
TestA_endp:

align_64
TestB_s:
NameB equ SinPlusCos Fpu
TestB proc
  mov ebx, AlgoLoops-1 ; loop e.g. 100x
  xor ecx, ecx
  align 4
  .Repeat
push ecx
fild stack
pop ecx
fmul FP8(0.01745329252) ; degrees->rad
fsincos ; uses the FPU
fadd
fstp res8
inc ecx
  .Until ecx>=360
  ret
TestB endp


Intel(R) Core(TM) i5-2450M CPU @ 2.50GHz (SSE4)

6745    cycles for 100 * SinPlusCos FastMath
39833   cycles for 100 * SinPlusCos Fpu

6797    cycles for 100 * SinPlusCos FastMath
39854   cycles for 100 * SinPlusCos Fpu

6741    cycles for 100 * SinPlusCos FastMath
39812   cycles for 100 * SinPlusCos Fpu

6745    cycles for 100 * SinPlusCos FastMath
39815   cycles for 100 * SinPlusCos Fpu

186     bytes for SinPlusCos FastMath
27      bytes for SinPlusCos Fpu

Real8   0.9823952887398167411   SinPlusCos FastMath
Real8   0.9823952887398167411   SinPlusCos Fpu
exact   0.9823952887191077263


Now the question is, of course, how fast and how accurate is LolRemez :rolleyes:
Title: Re: Atan2 SSE2
Post by: guga on March 02, 2022, 10:08:36 AM
Great work. But still uses a precalculate table, right ?

About Lolremez, i´m not sure how accurate it is. I didn´t tested the results on the example i gave. For what i saw so far, the accuracy depends on the amount of polynomials to be calculated. For example, using a 4th degree polyomial is less accurate than using a 9th degree. If we could build a routine that can be able to do the maths of those equations using the polynomials approximations (ChebyChev, LolRemez, MinMax etc etc), we can achieve a higher level of accuracy using more polynomials (and yet, this shouldn´t affect performance since we can do it in SSE2 using few instructions, without any loops).

This library seems to have a more accurate results, but didn´t tested yet.
https://www.boost.org/doc/libs/1_37_0/libs/math/doc/sf_and_dist/html/math_toolkit/using_udt/use_ntl.html

Or this one..seems to be even more accurated

https://www.boost.org/doc/libs/1_37_0/libs/math/doc/sf_and_dist/html/math_toolkit/backgrounders/lanczos.html
Title: Re: Atan2 SSE2
Post by: jj2007 on March 02, 2022, 10:19:45 AM
I thought you had the FreeBasic DLL for testing LolRemez... :cool:

It would be fun to add it to the SinPlusCos.zip testbed
Title: Re: Atan2 SSE2
Post by: guga on March 02, 2022, 10:21:27 AM
Quote from: jj2007 on March 02, 2022, 10:19:45 AM
I thought you had the FreeBasic DLL for testing LolRemez... :cool:

It would be fun to add it to the testbed

Yes, this is what i´m trying to do. Im quite finishing rebuilding Fb LolRemez to test. The main problem is that i got stuck today on a routine called "GaussJordan". I´m trying to fix it before go further on the porting.
Title: Re: Atan2 SSE2
Post by: jj2007 on March 02, 2022, 10:50:32 AM
I've tested it now, see above, SinPlusCosV2.zip

It's fast but not very accurate. I wonder how fast it is with the same accuracy as the fpu or FastMath (https://www.jj2007.eu/MasmBasicQuickReference.htm#Mb1482). Also, -1 ... +1 is not the full 0...360 degrees range.
Title: Re: Atan2 SSE2
Post by: jack on March 02, 2022, 11:17:25 AM
guga, if you are interested I can post a rational minimax approximation of arctan in the interval -1 to 1 with max error 1.34e-16
Title: Re: Atan2 SSE2
Post by: guga on March 02, 2022, 11:44:19 AM
Quote from: jj2007 on March 02, 2022, 10:50:32 AM
I've tested it now, see above, SinPlusCosV2.zip

It's fast but not very accurate. I wonder how fast it is with the same accuracy as the fpu or FastMath (https://www.jj2007.eu/MasmBasicQuickReference.htm#Mb1482). Also, -1 ... +1 is not the full 0...360 degrees range.

Hi JJ. I saw it. Indeed, it´s very fast, but lack precision (probably because it is using a 4th degree polynomial and not 9th degree). I´ll make further tests on it to se if Fb version is more accurate and how can we improve it more.

Title: Re: Atan2 SSE2
Post by: guga on March 02, 2022, 11:48:18 AM
Quote from: jack on March 02, 2022, 11:17:25 AM
guga, if you are interested I can post a rational minimax approximation of arctan in the interval -1 to 1 with max error 1.34e-16

Hi Jack, yes, please. :) Also, post how you achieved such result :) It would be nice trying o implement a app using better approximation technique.
Title: Re: Atan2 SSE2
Post by: guga on March 02, 2022, 12:45:42 PM
Hi JJ

try with these, please:

double f(double x)
{
    double u = 2.681518169196906e-6;
    u = u * x + 2.4121467561738585e-5;
    u = u * x + -1.9833112700127753e-4;
    u = u * x + -1.3882965671023472e-3;
    u = u * x + 8.3332934231267097e-3;
    u = u * x + 4.1666455708179865e-2;
    u = u * x + -1.6666665852866353e-1;
    u = u * x + -4.9999997368823986e-1;
    u = u * x + 9.9999999952226541e-1;
    return u * x + 9.999999994749521e-1;
}


I tried to insert it on masmbasic, but an error ocurred
Title: Re: Atan2 SSE2
Post by: jj2007 on March 02, 2022, 02:07:11 PM
Hi Gustavo,
Here are the results:

Intel(R) Core(TM) i5-2450M CPU @ 2.50GHz (SSE4)

2241    cycles for 100 * SinPlusCos FastMath
12007   cycles for 100 * SinPlusCos Fpu
950     cycles for 100 * LolRemez 5
2329    cycles for 100 * LolRemez 8

2106    cycles for 100 * SinPlusCos FastMath
11833   cycles for 100 * SinPlusCos Fpu
950     cycles for 100 * LolRemez 5
2329    cycles for 100 * LolRemez 8

2108    cycles for 100 * SinPlusCos FastMath
11833   cycles for 100 * SinPlusCos Fpu
949     cycles for 100 * LolRemez 5
2330    cycles for 100 * LolRemez 8

2106    cycles for 100 * SinPlusCos FastMath
11832   cycles for 100 * SinPlusCos Fpu
950     cycles for 100 * LolRemez 5
2330    cycles for 100 * LolRemez 8

183     bytes for SinPlusCos FastMath
24      bytes for SinPlusCos Fpu
72      bytes for LolRemez 5
124     bytes for LolRemez 8

Real8   1.383309602960451023    SinPlusCos FastMath
Real8   1.383309602960451023    SinPlusCos Fpu
Real8   1.382865190370866637    LolRemez 5
Real8   0.3833096037593586303   LolRemez 8
exact   1.383309602960451112


There is one tiny bit wrong, it seems. With 8 coefficients, it becomes difficult to keep track... see TestD :sad:
LolRemez 8 is slightly slower than FastMath, and it is very limited in its range. You have chosen -1 ... +1, which translates to degrees would be -57...58 degrees. Can you give me the coefficients for -PI ... +PI? That is FastMath's range in the attached version 3 (but only -57...+58 are used, in order to yield the same result as LolRemez).

Re precision: FastMath yields 16 digits of precision, compared to the value that I get with the Windows Calculator.
Title: Re: Atan2 SSE2
Post by: guga on March 02, 2022, 04:50:01 PM
That´s weird.

I thought that giving a range of -1 to 1 mean s that we are dealing with -Pi to Pi covering the full range, but using normalized values. I´ll take a look later to see how exactly this lolremez works
https://github.com/samhocevar/lolremez

I found some other interesting articles about LolRemez tests here:
http://a-hackers-craic.blogspot.com/2014/07/a-better-faster-sincoscexpi.html
and
https://github.com/samhocevar/lolremez
https://asmbits.blogspot.com/2016/03/vector-sin-and-cos-sse2.html
Title: Re: Atan2 SSE2
Post by: jack on March 02, 2022, 09:44:48 PM
guga, sorry for being late but just in case you want to give these a try

(x *(8883.0849736588075911 +
     x^2 *(20222.7394253468435665 +
        x^2 *(16159.2278648857071736 +
           x^2 *(5387.4278783915353868 +
              x^2 *(686.94856409857219691 +
                 22.393048501019117472 *
x^2))))))/(8883.0849736588087835 +
   x^2 *(23183.7677498992921884 +
      x^2 *(22110.5334534862024463 +
         x^2 *(9389.8642844889571880 +
            x^2 *(1719.75402628941589287 +
               x^2 *(107.898060132482141870 + x^2))))))

I used Mathematica with the following steps
first load the package

<< FunctionApproximations`

then approximate the function

f = MiniMaxApproximation[ ArcTan[Sqrt[x]]/Sqrt[x], {x, {1/10^50, 1}, 5, 6},  WorkingPrecision -> 80, Brake -> {500, 5}]

and extract the polynomial from the resulting list

f = f[[2, 1]]

the reason for approximating Arctan(Sqrt(x))/Sqrt(x) is to eliminate the odd polynomial so instead of
x - x^3/3 + x^5/5 ... you have 1 - x/3 + x^2/5 - x^3/7 and so the approximating polynomial won't have even powers of x with small coefficients but now we need to convert the polynomial obtained back to odd and at the same time reduce the digits of the polynomial

f = N[(f /. x -> x*x)*x, 22]

and convert the polynomial to HornerForm

HornerForm[f]

Title: Re: Atan2 SSE2
Post by: jack on March 02, 2022, 11:11:26 PM
guga, you could reduce the argument as follows, if x=Pi*2 then arctan(x) = 2*arctan(x/(1+sqrt(1+x^2))) so instead of arctan(6.28...) you calculate 2 * arctan(.853...)
reducing the interval to less than 0..1 for the minimax approximation doesn't payoff
Title: Re: Atan2 SSE2
Post by: jack on March 03, 2022, 12:11:38 AM
if you reduce the argument x as follows ( x in the range -Pi*2 .. Pi*2 )

for i=1 to 5
    x=x/(1+sqrt(1+x*x))
next i

then you could calculate the arctan using this polynomial

y = 32*((x * (202.153477236938512481 +
   x^2 * (305.19168618485308633 +
      x^2 * (127.474551705012762179 +
         12.6747296167999257198 * x^2))))/(202.153477236938513333 +
x^2 * (372.576178597164761655 +
    x^2 * (211.235915790272755720 +
       x^2 * (37.4505339820668783381 + x^2)))))

or you could simply use the power series

y = 32*(x-x^3/3+x^5/5-x^7/7+x^9/9-x^11/11)

or in HornerForm

y = 32*(x*(1+x^2*(-(1/3)+x^2*(1/5+x^2*(-(1/7)+x^2*(1/9-x^2/11))))))

or

y = x *(32 + x^2*(-(32/3) + x^2*(32/5 + x^2*(-(32/7) + x^2*(32/9 - (32*x^2)/11)))))
Title: Re: Atan2 SSE2
Post by: jj2007 on March 03, 2022, 02:08:57 AM
Quote from: guga on March 02, 2022, 04:50:01 PMI thought that giving a range of -1 to 1 mean s that we are dealing with -Pi to Pi covering the full range, but using normalized values.

Nope. If they were normalised, I would not get the correct values...

New version for sin+cos, this is now the code to calculate y(x):

PushPoly ecx*0.01745329251994329577,\
3.9297499716796932e-2,\
-1.5653249505225839e-1,\
-4.9891262416895649e-1,\
9.9750048049840423e-1,\
9.9991743032029927e-1
fstp res8


The *0.017 translates degrees to rad. The two LolRemez functions operate over the limit range of -57...+58 degress. If you give me the coefficients for the full -180...+180 or 0...360 degrees range, I can adapt them accordingly.

Timings:
Intel(R) Core(TM) i5-2450M CPU @ 2.50GHz (SSE4)

2256    cycles for 100 * SinPlusCos FastMath
11833   cycles for 100 * SinPlusCos Fpu
1308    cycles for 100 * LolRemez 5
3258    cycles for 100 * LolRemez 8

2106    cycles for 100 * SinPlusCos FastMath
12328   cycles for 100 * SinPlusCos Fpu
1309    cycles for 100 * LolRemez 5
3177    cycles for 100 * LolRemez 8

183     bytes for SinPlusCos FastMath
24      bytes for SinPlusCos Fpu
72      bytes for LolRemez 5
122     bytes for LolRemez 8

Real8   1.383309602960451023    SinPlusCos FastMath
Real8   1.383309602960451023    SinPlusCos Fpu
Real8   1.382865190370866637    LolRemez 5
Real8   1.383309603234310847    LolRemez 8
exact   1.383309602960451112
Title: Re: Atan2 SSE2
Post by: guga on March 03, 2022, 05:27:22 AM
Hi Guys

JJ, i don´t know yet how exactly the ranges works on the original lolremez version.

You can give a shot with the attached file or on it´s documentation here: https://github.com/samhocevar/lolremez

About Fb itself, i´m finishing to port, but i must confess, this is one of the most painful things i ever saw before. FB inserts tons of useless code during compiling and the code is so bad organized that it´s really really hard to port or even understand the code flow.
For example, to put a simple line on the console, FB code, seems to be easy, fast and reliable, right ? Something at principle, simple as:
   ' puts (String(Loword(Width)*Hiword(Width)," ")+Chr(10)) 'clear console
    puts("_____________________________")
    puts(!"Approxomating Polynomial\n")
     sh=copy + "  [" +str(lower) + " To " + str(upper) + "]"


Ok...but, that´s not true at all. Internally for inserting a simple line, FB does this:


    ; puts (String(Loword(Width)*Hiword(Width)," ")+Chr(10)) 'clear console
    C_call 'msvcrt.puts' {B$ "_____________________________", 0}
    C_call 'msvcrt.puts' {B$ "Approximating Polynomial", 0A, 0}
    call 'FbRtl32.fb_StrAssign' HValue, 0-1, DValue, 0-1, 0
    call 'FbRtl32.fb_StrConcatAssign' HValue, 0-1, {B$ "  [", 0}, 4, 0
    call 'FbRtl32.fb_DoubleToStr' D$Lower, D$Lower+4
    call 'FbRtl32.fb_StrConcatAssign' HValue, 0-1, eax, 0-1, 0
    call 'FbRtl32.fb_StrConcatAssign' HValue, 0-1, {B$ " To ", 0}, 5, 0
    call 'FbRtl32.fb_DoubleToStr' D$Upper, D$Upper+4
    call 'FbRtl32.fb_StrConcatAssign' HValue, 0-1, eax, 0-1, 0
    call 'FbRtl32.fb_StrConcatAssign' HValue, 0-1, {B$ "]", 0}, 2, 0
    C_call 'msvcrt.puts' D$HValue.pStringData


When, in fact, all of the above code, each function goes on several windows api (msvcrt), instead simply doing it in 2 lines of code (In fact, using only 2 of these apis: sprintf and puts) like this:

[Sz_PolyminalIntro: B$ "_____________________________
Approximating Polynomial

%s  [%.16g To %.16g]", 0]

[Sz_OutputStrBuff: B$ 0 #512]

    C_call 'msvcrt.sprintf' Sz_OutputStrBuff, Sz_PolyminalIntro, D$DValue.pStringData, D$Lower, D$Lower+4, D$Upper, D$Upper+4
    C_call 'msvcrt.puts' Sz_OutputStrBuff


Again, the whole idea is not bad, but the internal organization is terrible. On the other hand, BCX for example was used to produce a way cleaner code. (Don´t know how is it doing right now, because didn´t had time o test, but as far i remember, BCX was way more robust then FB - at least in terms of internal organization)

Anyway..i´ll continue to port this, and will try to finish today or tomorrow. Keep in mind, that, if FB did that mess on a single string outputted to a console, you may get an idea of what a hell i´m trying to fix on the rest of the code :biggrin: :biggrin: :biggrin: :biggrin:
Title: Re: Atan2 SSE2
Post by: jj2007 on March 03, 2022, 08:31:19 AM
Quote from: guga on March 03, 2022, 05:27:22 AMJJ, i don´t know yet how exactly the ranges works on the original lolremez version.
It's the -r parameter:
lolremez --double -d 9 -r "-pi:pi" "sin(x)+cos(x)"
pause

QuoteYou can give a shot with the attached file

Here are the timings: LolRemez is faster if you don't need precision. Note these are results over the full -180...+180 degrees range:

Intel(R) Core(TM) i5-2450M CPU @ 2.50GHz (SSE4)

6476    cycles for 100 * SinPlusCos FastMath
39649   cycles for 100 * SinPlusCos Fpu
5289    cycles for 100 * LolRemez 6
6341    cycles for 100 * LolRemez 7
7350    cycles for 100 * LolRemez 8
10189   cycles for 100 * LolRemez 10

6480    cycles for 100 * SinPlusCos FastMath
39487   cycles for 100 * SinPlusCos Fpu
5296    cycles for 100 * LolRemez 6
6334    cycles for 100 * LolRemez 7
7378    cycles for 100 * LolRemez 8
10195   cycles for 100 * LolRemez 10

75      bytes for SinPlusCos FastMath
33      bytes for SinPlusCos Fpu
83      bytes for LolRemez 6
93      bytes for LolRemez 7
103     bytes for LolRemez 8
123     bytes for LolRemez 10

exact   -0.9823952887191077263
Real8   -0.9823952887191077510  SinPlusCos FastMath
Real8   -0.9823952887191077510  SinPlusCos Fpu
Real8   -0.9580416352510443546  LolRemez 6
Real8   -0.9768328662750306313  LolRemez 7
Real8   -0.9833290686311381146  LolRemez 8
Real8   -0.9823756874760426472  LolRemez 10
Title: Re: Atan2 SSE2
Post by: guga on March 03, 2022, 01:11:56 PM
Hi JJ. So, on this example it´s not totally usefull. I´m finishing the app to we test later, but ...in meanwhile i found some interesting material that can improve accuracy using something called Caratheodory-Fejer approximation. This method seems to use the Chebyshev coefficients to decrease the error.

it´s way out of my head right now. But, these articles are the kind of material that Siekmanski and Jack would like  :azn: :azn: :azn:

https://encyclopediaofmath.org/wiki/Carath%C3%A9odory-Fej%C3%A9r_problem
https://www.embeddedrelated.com/showarticle/152.php
https://www.chebfun.org/docs/guide/guide04.html
https://www.mathworks.com/matlabcentral/fileexchange/22055-caratheodory-fejer-approximation
https://books.google.com.br/books?id=xsnBDwAAQBAJ&pg=PA159&lpg=PA159&dq=Caratheodory-Fejer+approximation,&source=bl&ots=uIKjZvoFNT&sig=ACfU3U1fb_hKZ9FRvyqG7b5y5KO4pCWsnA&hl=pt-BR&sa=X&ved=2ahUKEwiRn8_X7Kj2AhXJr5UCHQ8tDVo4HhDoAXoECA8QAw#v=onepage&q=Caratheodory-Fejer%20approximation%2C&f=false
Title: Re: Atan2 SSE2
Post by: jj2007 on March 03, 2022, 08:04:47 PM
Quote from: guga on March 03, 2022, 01:11:56 PM
Hi JJ. So, on this example it´s not totally usefull.

FastMath (https://www.jj2007.eu/MasmBasicQuickReference.htm#Mb1482) is a factor 6 faster than the FPU, and on par with a LolRemez with 7 coefficients. The latter deviates by half a percent from the true result; that is 9 pixels on a screen 1600 pixels wide... so it depends on your usage. IIRC you need it for colours, so that is 256*0.5%->+-0.7 for the LolRemez 7. I suppose you can live with that :smiley:

6476    cycles for 100 * SinPlusCos FastMath
39649   cycles for 100 * SinPlusCos Fpu
5289    cycles for 100 * LolRemez 6
6341    cycles for 100 * LolRemez 7
...
exact   -0.9823952887191077263
Real8   -0.9823952887191077510  SinPlusCos FastMath
Real8   -0.9823952887191077510  SinPlusCos Fpu
Real8   -0.9580416352510443546  LolRemez 6
Real8   -0.9768328662750306313  LolRemez 7
Title: Re: Atan2 SSE2
Post by: guga on March 04, 2022, 05:57:15 AM
Hi JJ

Agree when you say that for colors, it may not be that precise, but for other needs it maybe necessary something that grants more precision. I´m quite finishing the Fb version and will release it soon to we test both the Dodicat´s approach and the FB runtime dlls we created  :thumbsup: :thumbsup: :thumbsup:

This is the app fully working on the same way as the original. I´ll clean up the code and will go to the fix of the export (savings) now.

One question, i plan to export the result also in Masm and RosAsm syntax, so we can use this thing if needed, but i don´t know how to create multiline comments in masm.

In Rosasm when i want a multiline comment, insert it all in between double semicolon in between the text. Like:
;;
     This is a comment
;;


How i do it in masm ? What is the syntax that allow me to do multi-line comments in masm ?


(https://i.ibb.co/vdnt4q5/vszx-Untitled.jpg) (https://ibb.co/HP8JT4v)
Title: Re: Atan2 SSE2
Post by: HSE on March 04, 2022, 06:09:56 AM
Standard is:

comment *
     This is a comment
*

Can be whatever character not used in commented lines.

To comment code:if 0
commented code
endif

because you can uncomment:if 1
commented code
endif



Title: Re: Atan2 SSE2
Post by: jj2007 on March 04, 2022, 07:58:12 AM
Quote from: HSE on March 04, 2022, 06:09:56 AMTo comment code:if 0
  commented code
endif

That's what I use (169 occurrences in the RichMasm source...)

For comparing several variants, I often use
if 0
   ... code version A ...
elseif 1
   ... code version B ... (will be used in this example)
else
   ... code version C ...
endif
Title: Re: Atan2 SSE2
Post by: FORTRANS on March 04, 2022, 08:19:22 AM
Hi,

   There is the "COMMENT" directive.

        COMMENT !

   That works as [COMMENT Symbol] and when the symbol
next appears, it ends the comments section.  (With the
following line as code, I think.)

                !
; This is now normal code.
        MOV     EAX,-1


Cheers,

Steve N.
Title: Re: Atan2 SSE2
Post by: daydreamer on March 04, 2022, 07:02:13 PM
Quote from: FORTRANS on March 04, 2022, 08:19:22 AM
Hi,

   There is the "COMMENT" directive.

        COMMENT !

   That works as [COMMENT Symbol] and when the symbol
next appears, it ends the comments section.  (With the
following line as code, I think.)

                !
; This is now normal code.
        MOV     EAX,-1


Cheers,

Steve N.
thanks Steve :thumbsup:
makes it easier to translate C style 
/*
multiline comment
*/
Title: Re: Atan2 SSE2
Post by: InfiniteLoop on April 21, 2022, 11:47:56 PM
How are you calculating the "lolremez" AKA chebyshev coefficients? Are you using the formula or solving the simultaneous equations? I got stuck and gave up ages ago.
Title: Re: Atan2 SSE2
Post by: jj2007 on April 22, 2022, 01:14:27 AM
See Reply #61