## News:

Message to All Guests
NB: Posting URL's See here: Posted URL Change

## Atan2 SSE2

Started by guga, February 07, 2022, 08:27:40 AM

#### guga

Hi Guys

Someone succeeded to create a atan2 function similar (or even better) to the ones existent in Intel ?

https://www.intel.com/content/www/us/en/develop/documentation/cpp-compiler-developer-guide-and-reference/top/compiler-reference/intrinsics/intrinsics-for-short-vector-math-library-ops/intrinsics-for-trigonometric-operations/mm-atan2-pd-mm256-atan2-pd.html
Coding in Assembly requires a mix of:
80% of brain, passion, intuition, creativity
10% of programming skills
10% of alcoholic levels in your blood.

My Code Sites:
http://rosasm.freeforums.org
http://winasm.tripod.com

#### jj2007

Have you seen How to Find a Fast Floating-Point atan2 Approximation?

If your problem is to find atan2(X,Y), and either X or Y is a constant, then FastMath would be an option.

Here is one that I don't understand

#### guga

Hi JJ. I´ll give a try.

Also, ucrtbase.dll have a atan2 function using SSE2 ( ___libm_sse2_atan2), but i didn´t tested yet for speed. I wonder if there is a faster way to do it. For the normal way using FPU, i did some routine years ago that can also convert to Hue angles. The problem is that it needs to be fst, and i presume the FPU way is not fast enough, specially when i plan to use it with Marinus FFT routines for audio.

`[Float_AtanPiFactor: R\$ (180/3.1415926535897932384626433832795)][Float360: R\$ 360]Proc atan2:    Arguments @pY, @pX, @ConvDegree    Structure @TempStorage 16, @TmpDataDis 0    Uses eax, ebx, ecx    mov ebx D@pY    mov ecx D@pX    fld R\$ebx    fld R\$ecx    fpatan    fstsw ax    wait    shr ax 1    jnb L2>        fclex | stc | xor eax eax | ExitPL2:    .If D@ConvDegree = &TRUE        fmul R\$Float_AtanPiFactor | fst R@TmpDataDis        Fpu_If R@TmpDataDis < R\$FloatZero            fadd R\$Float360        Fpu_Else_If R@TmpDataDis >= R\$Float360            fsub R\$Float360        Fpu_End_If    .End_If    clc    mov eax &TRUEEndP`
Coding in Assembly requires a mix of:
80% of brain, passion, intuition, creativity
10% of programming skills
10% of alcoholic levels in your blood.

My Code Sites:
http://rosasm.freeforums.org
http://winasm.tripod.com

#### guga

Found something interesting using polynomials

https://mazzo.li/posts/vectorized-atan2.html
https://pub.dev/documentation/complex/latest/fastmath/atan.html
https://www.dsprelated.com/showarticle/1052.php
https://opensource.apple.com/source/Libm/Libm-315/Source/Intel/atan.c
https://stackoverflow.com/questions/11930594/calculate-atan2-without-std-functions-or-c99
https://www.examplefiles.net/cs/453132
Coding in Assembly requires a mix of:
80% of brain, passion, intuition, creativity
10% of programming skills
10% of alcoholic levels in your blood.

My Code Sites:
http://rosasm.freeforums.org
http://winasm.tripod.com

#### guga

Hi guys

Can someone test the timmings for this please ?

I succeeded to rewrite a atan2 functions using SSE2 from ucrtbase.dll and wanted to know if it is really that fast as it seems (also compared to the original version as well). The function works the same as in __libm_sse2_atan2 from ucrtbase.dll. I optimized it a little bit only reorganizing the function. (The original version is a kind of a mess due to heavy spaghetti code)

I´m pretty sure it can be optimized further for speed, but i didn´t fully understood yet, how it works.

I built it on a dll just to make easier to the benchmark tests. The function is Sse2_atan2 in the test.dll. It doesn't contains any parameters.

To calculate the atan2 you need to place the values of x, y (in double) in the registers XMM0 and XMM1 (as it is on the original version).

In Rosasm syntax the function can be called as:

[MyValue1: R\$ 0.554545]
[MyValue2: R\$ 0.1]

movupd XMM0 X\$MyValue1
movupd XMM1 X\$MyValue2

call Sse2_atan2

The original version (and also mine) uses a table of tangents to calculate the atan2 of a value. The table consists of 164 pairs of Real8 values. The 1st one representing the tangent of a certain value and the next Real8 seems to be some error /difference of some sort.

For my tests, i excluded this error case and used only the true calculated values of the tangents.

This table seems to be generated on  a certain order. It consists of 164 values whose last starting from 0.02928849741073 to Pi/2. So it starts with the tangent of a certain value: Value at Pos1 = atan(0.029296875) to atan(32) (Value at Pos 163).  And the last value (Pos164) since it cannot be atan(pi/2), it restricted to  Pi/2. On each 16 values (from the last to the 1st - excepting the last one), it decreaes the value to be calculate by 1/2, starting with 1.

So, at Pos 163 = atan(32) ---> decreasing 1 on each loop until Pos 147
Pos 147 = atan(16) ---> decreasing the half of 1. So, decreasing 0.5 on the next loop untill Pos131
Pos131 = atan(8) ---> decreasing the half of 0.5. So, decreasing 0.25 on the next loop untill Pos115
etc

I built a function that reproduces the values of the original table (except the 2nd Real8 for error cases, which i filled only with 0 for now):

`[MyTangTBl: R\$ 0 #(164*2)][Float_PI: R\$ 3.141592653589793238462643383279502884197169399375105820974944592307]Proc CreateTangetTable:    Local @Counter, @Value, @InternalCounter, @DecreaseRate    Uses esi    mov D@Counter 164    mov D@InternalCounter 0    fld1 | fstp F@DecreaseRate    mov esi MyTangTBl    add esi (164*2*8) | sub esi (8*2)    fld R\$Float_PI | fstp R\$esi    dec D@Counter    sub esi (8*2)    mov D@Value 33 | fild D@Value | fstp F@Value    .Do        fld D@Value | fsub F@DecreaseRate | fst F@Value | fld1 | fpatan | fstp R\$esi                    If D@InternalCounter = 16            fld F@DecreaseRate | fmul F\$Float_half | fstp F@DecreaseRate            mov D@InternalCounter 0        End_If        inc D@InternalCounter        sub esi (8*2)        dec D@Counter    .Loop_Until D@Counter = 0L1:EndP`

The original values of this table (including the error values - I didn´t included the error values on mine version, btw, but it don´t affect the precision too much) is :
`DoubleData <0.02928849741073059, 3.878020342543118e-16>DoubleData <0.03026419423862503, 2.13439966346682e-16>DoubleData <0.03123983343026815, 1.27181094510696e-16>DoubleData <0.03319093149711083, 7.623353902758895e-16>DoubleData <0.03514177680279662, 1.622534112980482e-16>DoubleData <0.03709235455039117, 6.433766265361664e-16>DoubleData <0.03904264995516638, 6.181886837851646e-16>DoubleData <0.04099264824526294, 8.420830387839661e-16>DoubleData <0.04294233466236186, 3.142811786546985e-16>DoubleData <0.04489169446234609, 4.096264921495231e-16>DoubleData <0.04684071291596936, 2.912679762198781e-16>DoubleData <0.04878937530951521, 4.053693341011583e-16>DoubleData <0.05073766694545956, 6.612695435815108e-16>DoubleData <0.05268557314312972, 3.289966758135281e-16>DoubleData <0.05463307923935901, 4.691749851066725e-16>DoubleData <0.05658017058914488, 8.289832696419427e-16>DoubleData <0.05852683256630176, 1.362951662640667e-17>DoubleData <0.06047305056410668, 6.37811249269131e-16>DoubleData <0.06241880999595661, 7.339736781833367e-16>DoubleData <0.06630889491982295, 5.407451321058335e-16>DoubleData <0.07019697107187017, 3.451465030350394e-16>DoubleData <0.07408292254903337, 3.609579312927062e-16>DoubleData <0.0779666338315419, 4.082603982997626e-16>DoubleData <0.08184798980307573, 8.205279419748868e-16>DoubleData <0.08572687577074412, 6.992365845342258e-16>DoubleData <0.08960317748487157, 1.793304156180366e-16>DoubleData <0.09347678115858926, 2.018823445176747e-16>DoubleData <0.09734757348722312, 5.576265782080219e-16>DoubleData <0.101215441667466, 6.856928051415318e-16>DoubleData <0.1050802734163288, 7.80192436556187e-16>DoubleData <0.1089419569898658, 4.846007563068433e-17>DoubleData <0.1128003812016587, 6.957566319383172e-16>DoubleData <0.1166554354410687, 6.438661649715736e-16>DoubleData <0.1205070096912237, 8.182569515647907e-16>DoubleData <0.1243549945467608, 6.352529150170111e-16>DoubleData <0.1320397616146387, 4.274189722415823e-17>DoubleData <0.1397088742891635, 1.913310428846708e-16>DoubleData <0.1473614810886508, 8.658324432321626e-16>DoubleData <0.1549967419239406, 3.426523229816613e-16>DoubleData <0.1626138285979479, 6.461627098025713e-16>DoubleData <0.1702119252854741, 2.74014592076487e-16>DoubleData <0.1777902289926754, 6.898598082898684e-16>DoubleData <0.1853479499956947, 5.969184350010091e-17>DoubleData <0.1928843122579744, 2.42385590364413e-16>DoubleData <0.2003985538258783, 1.696734079809579e-16>DoubleData <0.2078899272022623, 7.012225510572437e-16>DoubleData <0.215357699697738, 5.59849672442657e-17>DoubleData <0.2228011537593941, 3.830792364463579e-16>DoubleData <0.2302195872768431, 6.229360680729788e-16>DoubleData <0.2376123138654709, 3.71404797316887e-16>DoubleData <0.2449786631268633, 8.156104484719729e-16>DoubleData <0.2596296294082574, 1.30261057387131e-16>DoubleData <0.2741674511196583, 4.523505634252264e-16>DoubleData <0.2885873618940771, 3.187832078137744e-16>DoubleData <0.3028848683749708, 6.551229868720925e-16>DoubleData <0.3170557532091465, 5.361722230696518e-16>DoubleData <0.3310960767041315, 5.471589019367845e-16>DoubleData <0.345002177207105, 8.808349770693735e-17>DoubleData <0.3587706702705722, 3.088733564861919e-17>DoubleData <0.3723984466767538, 4.081903701236505e-16>DoubleData <0.3858826693980735, 3.013439834812086e-16>DoubleData <0.3992207695752521, 4.665551909062331e-16>DoubleData <0.4124104415973866, 7.057684437286449e-16>DoubleData <0.4254496373700416, 6.894493455169868e-16>DoubleData <0.4383365598579569, 8.632356493938598e-16>DoubleData <0.4510696559885234, 3.280735600183736e-17>DoubleData <0.4636476090008053, 8.553660459218291e-16>DoubleData <0.4883339510564051, 4.32715973660733e-16>DoubleData <0.5123894603107368, 8.627156382272694e-16>DoubleData <0.5358112379604636, 1.069585067790331e-16>DoubleData <0.558599315343562, 4.38633579301471e-16>DoubleData <0.5807563535676703, 9.660765868058498e-17>DoubleData <0.6022873461349638, 3.62571214759831e-16>DoubleData <0.6231993299340655, 4.708132487014636e-16>DoubleData <0.6435011087932843, 1.268570875139599e-16>DoubleData <0.6632029927060925, 7.463955685933131e-16>DoubleData <0.6823165548747481, 6.943223671560008e-18>DoubleData <0.7008544078844494, 7.572798548942514e-16>DoubleData <0.7188299996216241, 4.226108214056057e-16>DoubleData <0.7362574289814274, 7.008731912580885e-16>DoubleData <0.7531512809621939, 5.308545776533963e-16>DoubleData <0.7695264804056574, 8.51128500644098e-16>DoubleData <0.7853981633974483, 3.061616997868383e-17>DoubleData <0.8156919233162228, 6.554192491473065e-16>DoubleData <0.8441539861131702, 8.397650495807761e-16>DoubleData <0.8709034570756522, 7.544578813301367e-16>DoubleData <0.8960553845713433, 6.95372577632837e-16>DoubleData <0.9197196053504166, 1.814702107965036e-16>DoubleData <0.9420000403794635, 1.656306773209825e-16>DoubleData <0.9629943306809361, 1.070356418673049e-16>DoubleData <0.9827937232473287, 3.46970218418778e-16>DoubleData <1.001483135694234, 7.605168950105478e-16>DoubleData <1.019141344266349, 6.761378336444607e-16>DoubleData <1.0358412530088, 3.194313981784504e-17>DoubleData <1.051650212548373, 5.696281674604188e-16>DoubleData <1.066630365315743, 6.065679184034902e-16>DoubleData <1.080839000541168, 2.063682824136722e-16>DoubleData <1.09432890732119, 2.165539287700089e-16>DoubleData <1.10714871779409, 9.404471373566379e-17>DoubleData <1.13095374397916, 5.153275478954471e-16>DoubleData <1.152571997215667, 1.304472198360275e-16>DoubleData <1.172273881128476, 7.499857009153807e-16>DoubleData <1.190289949682532, 7.683333629842069e-17>DoubleData <1.206817370285252, 2.637692813136457e-16>DoubleData <1.222025323210989, 6.363421861261654e-16>DoubleData <1.236059489478081, 7.449313421696882e-16>DoubleData <1.249045772398254, 8.859822159005129e-16>DoubleData <1.26109338225244, 2.544660011403809e-16>DoubleData <1.272297395208717, 2.245875015034507e-17>DoubleData <1.28274087974427, 8.788952309458591e-16>DoubleData <1.292496667789785, 1.537365572357647e-16>DoubleData <1.301628834009196, 2.09675419926785e-16>DoubleData <1.310193935047555, 7.535879521228967e-16>DoubleData <1.318242051016837, 3.808952695386159e-16>DoubleData <1.325817663668032, 1.3380031118552e-16>DoubleData <1.339705659598999, 6.401436961720526e-16>DoubleData <1.352127380920955, 2.147674250751151e-17>DoubleData <1.363300100359694, 3.313692220777249e-16>DoubleData <1.373400766945015, 6.330567112173988e-16>DoubleData <1.382574821490126, 1.86429700538549e-16>DoubleData <1.390942827002418, 7.897412983652368e-16>DoubleData <1.398605512271957, 4.208485980241463e-16>DoubleData <1.40564764938027, 1.328183035426864e-16>DoubleData <1.412141064608495, 5.703957436695217e-16>DoubleData <1.418146998399631, 8.055395818750151e-16>DoubleData <1.423717971406494, 3.09263314147271e-16>DoubleData <1.428899272190733, 1.574732574926438e-16>DoubleData <1.433730152484708, 8.442163750324489e-16>DoubleData <1.438244794498222, 6.412036156724483e-16>DoubleData <1.442473099109101, 7.776664761570937e-16>DoubleData <1.446441332248135, 3.141578446404818e-16>DoubleData <1.453687582228032, 8.199907271986118e-16>DoubleData <1.460139105621001, 2.846543520033397e-16>DoubleData <1.465919388064663, 1.24932049384283e-16>DoubleData <1.471127674303734, 7.79686192079511e-16>DoubleData <1.47584462045214, 4.779648065777258e-16>DoubleData <1.480136439594151, 3.07070859685828e-16>DoubleData <1.484057988118911, 4.096346991714267e-16>DoubleData <1.487655094906455, 7.844371739461077e-17>DoubleData <1.490966341082659, 6.221434760020122e-17>DoubleData <1.494024435525118, 8.134142446723606e-16>DoubleData <1.496857289136956, 6.830938909876451e-16>DoubleData <1.499488862009605, 8.873091857396741e-16>DoubleData <1.501939837493851, 7.267528106158521e-16>DoubleData <1.504228163019072, 4.532269913842895e-16>DoubleData <1.506369487369343, 5.605999046418969e-16>DoubleData <1.508377516798939, 2.154370814741562e-16>DoubleData <1.512040504079174, 1.402618800554007e-16>DoubleData <1.515297821549179, 5.368157936821226e-16>DoubleData <1.518213265183954, 7.375391359310956e-16>DoubleData <1.520837931072953, 6.825613612540716e-16>DoubleData <1.523213223517913, 4.501543596256141e-16>DoubleData <1.52537304737332, 2.482983385700394e-17>DoubleData <1.527345431403365, 7.934779986221159e-16>DoubleData <1.529153747696308, 6.760940733854243e-16>DoubleData <1.530817639671606, 3.549557335150754e-16>DoubleData <1.532353736773708, 2.956322283362237e-16>DoubleData <1.533776210920966, 9.377354806572844e-17>DoubleData <1.535097214115572, 7.767503702335957e-16>DoubleData <1.536327225795388, 8.864446875291863e-16>DoubleData <1.537475330916649, 8.116858600761242e-17>DoubleData <1.538549444359642, 5.614707476320801e-16>DoubleData <1.539556493364628, 8.22229665146797e-16>DoubleData <1.570796326794897, 6.123233995736766e-17>`

The source is embedded in the dll but i can post it here later if the tests are ok. It shouldn´t be hard to port to masm, if someone is interested to work further with the function.

As the original version it can only work with 1 calculated value at once. (Double). But, i presume it could be optimized to handle 4 Floats at once to gain even more speed.
Coding in Assembly requires a mix of:
80% of brain, passion, intuition, creativity
10% of programming skills
10% of alcoholic levels in your blood.

My Code Sites:
http://rosasm.freeforums.org
http://winasm.tripod.com

#### guga

Btw, to convert it to degree you can also use this:

`[Float_AtanPiFactor: R\$ (180/3.1415926535897932384626433832795)][Float360: R\$ 360][Float_Zero: R\$ 0]Proc SSE_Atan2Double:    Arguments @pY, @pX, @ConvDegree    Uses eax, ebx, ecx    mov ebx D@pY    mov ecx D@pX    movupd XMM0 X\$ebx    movupd XMM1 X\$ecx    ;call 'ucrtbase.__libm_sse2_atan2' <---original version inside ucrtbase.dll    call Sse2_atan2 <---mine tests.    .If D@ConvDegree = &TRUE ; <--- Flag to convert to degree (True/False)        movupd XMM1 X\$Float_AtanPiFactor        mulpd XMM0 XMM1        xorpd xmm2 xmm2        SSE_D_If xmm0 < xmm2 ; A simple macro to handle sse2 comparisons. It is simply: COMISD XMM0 XMM2 | JNB D0> So, it is comisd followed by the corresponding jmp to D0 (Our Else_IF macro)            addpd xmm0 X\$Float360        SSE_D_Else_If xmm0 > X\$Float360 ; same as above:     JMP D1> | D0:COMISD XMM0 X\$FLOAT360 | JNA D0>            subpd xmm0 X\$Float360        SSE_D_End_If ; The ending of the jumps D0: D1:    .End_If    mov eax &TRUEEndP`

Example of usage:

[TestingAtan2Data1: R\$ 0.5317094316614787480759158718400589803054643151426570508592559650]
[TestingAtan2Data2: R\$ 1]

call SSE_Atan2Double TestingAtan2Data1, TestingAtan2Data2, &TRUE

It will return in ST0 = 28º
Coding in Assembly requires a mix of:
80% of brain, passion, intuition, creativity
10% of programming skills
10% of alcoholic levels in your blood.

My Code Sites:
http://rosasm.freeforums.org
http://winasm.tripod.com

#### jj2007

Quote from: guga on February 08, 2022, 11:10:30 PM
Can someone test the timmings for this please ?

It's a DLL, Gustavo... how are we supposed to test it?

#### guga

The same way as you did for the FastMath. Didn´t you used a external dll to call the function ? The benchmark apps can handle external dlls, right ?

something like

importlib test.lib

invoke Sse2_atan2

But, I can try disassemble it and create a source compatible to masm, if needed.
Coding in Assembly requires a mix of:
80% of brain, passion, intuition, creativity
10% of programming skills
10% of alcoholic levels in your blood.

My Code Sites:
http://rosasm.freeforums.org
http://winasm.tripod.com

#### guga

Please, see if this helps (It´s the disassembled source)
Coding in Assembly requires a mix of:
80% of brain, passion, intuition, creativity
10% of programming skills
10% of alcoholic levels in your blood.

My Code Sites:
http://rosasm.freeforums.org
http://winasm.tripod.com

#### jj2007

Quote from: guga on February 09, 2022, 05:13:51 AM
The same way as you did for the FastMath. Didn´t you used a external dll to call the function ? The benchmark apps can handle external dlls, right ?

something like

importlib test.lib

invoke Sse2_atan2

It can, Guga, but I need a minimum of documentation. Something like atan2=Sse2_atan2(double X, double Y)?
For now,
`  Dll "GugaAtan2"  Declare Sse2_atan2, 2`

fails miserably saying "can't find module Sse2_atan2"

#### HSE

Working here! (using source code)
No timing yet

erased a minor detail
Equations in Assembly: SmplMath

#### guga

Quote from: jj2007 on February 09, 2022, 06:04:39 AM
Quote from: guga on February 09, 2022, 05:13:51 AM
The same way as you did for the FastMath. Didn´t you used a external dll to call the function ? The benchmark apps can handle external dlls, right ?

something like

importlib test.lib

invoke Sse2_atan2

It can, Guga, but I need a minimum of documentation. Something like atan2=Sse2_atan2(double X, double Y)?
For now,
`  Dll "GugaAtan2"  Declare Sse2_atan2, 2`

fails miserably saying "can't find module Sse2_atan2"

That´s weird. The exported function i named as "Sse2_atan2". Are you sure you are importing test.dll ? Because the name of the dll is not "GugaAtan2"

Shouldn´t it be ?

Dll "test"
Declare Sse2_atan2, 1

The ordinal value of the function inside the dll is 1 and not 2
Coding in Assembly requires a mix of:
80% of brain, passion, intuition, creativity
10% of programming skills
10% of alcoholic levels in your blood.

My Code Sites:
http://rosasm.freeforums.org
http://winasm.tripod.com

#### guga

Quote from: HSE on February 09, 2022, 07:23:02 AM
Working here!
No timing yet

erased a minor detail

Tks HSE
Coding in Assembly requires a mix of:
80% of brain, passion, intuition, creativity
10% of programming skills
10% of alcoholic levels in your blood.

My Code Sites:
http://rosasm.freeforums.org
http://winasm.tripod.com

#### guga

Quote from: jj2007 on February 09, 2022, 06:04:39 AM
Quote from: guga on February 09, 2022, 05:13:51 AM
The same way as you did for the FastMath. Didn´t you used a external dll to call the function ? The benchmark apps can handle external dlls, right ?

something like

importlib test.lib

invoke Sse2_atan2

It can, Guga, but I need a minimum of documentation. Something like atan2=Sse2_atan2(double X, double Y)?
For now,
`  Dll "GugaAtan2"  Declare Sse2_atan2, 2`

fails miserably saying "can't find module Sse2_atan2"

Sorry, i forgot to answer this

"Sse2_atan2(double X, double Y)?"

Yes, the contents of xmm1 and xmm0 are doubles.(The same as in the original library - ucrtbase.dll). The function does not use any parameter, but you must feed  xmm0 and xmm1 registers with double values.

Internally i rewrote the function as this:

`; The table was generated as CreateTangetTable; I rewrote/reorganized the whole function to avoid heavy usage of spaghetti code as in the original version inside ucrtbase.dllProc Sse2_atan2::    movlpd X\$SSEAtanTmpVal1 xmm0 ; <-----Temporary global variable to store the data to be used in the internal functions    movlpd X\$SSEAtanTmpVal2 xmm1 ; <-----Temporary global variable to store the data to be used in the internal functions    pextrw eax xmm0 3 | and eax 07FF0 | sub eax 03870    pextrw ecx xmm1 3 | and ecx 07FF0 | sub ecx 03870    .If_And eax <= 0F00, ecx <= 0F00        call Sse2_atan2Internal4    .Else        pextrw ecx xmm0 3 | and ecx 07FF0        pextrw eax xmm1 3 | and eax 07FF0        If ecx = 07FF0            call Sse2_atan2Internal1        Else_If eax = 07FF0            call Sse2_atan2Internal2        Else            call Sse2_atan2Internal5        End_If    .End_IfEndP`
Coding in Assembly requires a mix of:
80% of brain, passion, intuition, creativity
10% of programming skills
10% of alcoholic levels in your blood.

My Code Sites:
http://rosasm.freeforums.org
http://winasm.tripod.com

#### jj2007

Quote from: guga on February 09, 2022, 07:30:32 AMDll "test"
Declare Sse2_atan2, 1

The ordinal value of the function inside the dll is 1 and not 2

I changed the name of the DLL to avoid ambiguities. Declare expects the number of args, but I found out debugging your DLL that you pass the two arguments in xmm0 and xmm1 (which is pretty unorthodox).

So, here are results for Sse2_atan2(1.0, 2.0) - as compared to ucrtbase atan2:

`Intel(R) Core(TM) i5-2450M CPU @ 2.50GHz (SSE4)6472    cycles for 100 * Sse2_atan217801   cycles for 100 * ucrtbase atan26443    cycles for 100 * Sse2_atan217799   cycles for 100 * ucrtbase atan26424    cycles for 100 * Sse2_atan217810   cycles for 100 * ucrtbase atan26423    cycles for 100 * Sse2_atan217841   cycles for 100 * ucrtbase atan2123     bytes for Sse2_atan2127     bytes for ucrtbase atan2Real8   -56.30993247402027180   Sse2_atan2Real8   -56.30993247402021495   ucrtbase atan2`

The two calls are as follows:

`  Dll "GugaAtan2"  Declare Sse2_atan2, 0   ; 0 arguments because they are passed in xmm0 and xmm1  Dll "ucrtbase"  Declare atan2, C:2      ; 2 arguments, C calling convention...movlps xmm0, FP8(-33.0)movlps xmm1, FP8(22.0)void Sse2_atan2()void atan2(FP8(-33.0), FP8(22.0))`

See TestA and TestB for differences in the way the two algos return their results.