News:

Masm32 SDK description, downloads and other helpful links
Message to All Guests
NB: Posting URL's See here: Posted URL Change

Main Menu

Atan2 SSE2

Started by guga, February 07, 2022, 08:27:40 AM

Previous topic - Next topic

guga

Hi Guys

Someone succeeded to create a atan2 function similar (or even better) to the ones existent in Intel ?

https://www.intel.com/content/www/us/en/develop/documentation/cpp-compiler-developer-guide-and-reference/top/compiler-reference/intrinsics/intrinsics-for-short-vector-math-library-ops/intrinsics-for-trigonometric-operations/mm-atan2-pd-mm256-atan2-pd.html
Coding in Assembly requires a mix of:
80% of brain, passion, intuition, creativity
10% of programming skills
10% of alcoholic levels in your blood.

My Code Sites:
http://rosasm.freeforums.org
http://winasm.tripod.com

jj2007

Have you seen How to Find a Fast Floating-Point atan2 Approximation?

If your problem is to find atan2(X,Y), and either X or Y is a constant, then FastMath would be an option.

Here is one that I don't understand :sad:

guga

Hi JJ. I´ll give a try.

Also, ucrtbase.dll have a atan2 function using SSE2 ( ___libm_sse2_atan2), but i didn´t tested yet for speed. I wonder if there is a faster way to do it. For the normal way using FPU, i did some routine years ago that can also convert to Hue angles. The problem is that it needs to be fst, and i presume the FPU way is not fast enough, specially when i plan to use it with Marinus FFT routines for audio.


[Float_AtanPiFactor: R$ (180/3.1415926535897932384626433832795)]
[Float360: R$ 360]

Proc atan2:
    Arguments @pY, @pX, @ConvDegree
    Structure @TempStorage 16, @TmpDataDis 0
    Uses eax, ebx, ecx

    mov ebx D@pY
    mov ecx D@pX

    fld R$ebx
    fld R$ecx
    fpatan
    fstsw ax
    wait
    shr ax 1
    jnb L2>
        fclex | stc | xor eax eax | ExitP
L2:

    .If D@ConvDegree = &TRUE
        fmul R$Float_AtanPiFactor | fst R@TmpDataDis
        Fpu_If R@TmpDataDis < R$FloatZero
            fadd R$Float360
        Fpu_Else_If R@TmpDataDis >= R$Float360
            fsub R$Float360
        Fpu_End_If
    .End_If

    clc
    mov eax &TRUE

EndP
Coding in Assembly requires a mix of:
80% of brain, passion, intuition, creativity
10% of programming skills
10% of alcoholic levels in your blood.

My Code Sites:
http://rosasm.freeforums.org
http://winasm.tripod.com

guga

Found something interesting using polynomials

https://mazzo.li/posts/vectorized-atan2.html
https://pub.dev/documentation/complex/latest/fastmath/atan.html
https://www.dsprelated.com/showarticle/1052.php
https://opensource.apple.com/source/Libm/Libm-315/Source/Intel/atan.c
https://stackoverflow.com/questions/11930594/calculate-atan2-without-std-functions-or-c99
https://www.examplefiles.net/cs/453132
Coding in Assembly requires a mix of:
80% of brain, passion, intuition, creativity
10% of programming skills
10% of alcoholic levels in your blood.

My Code Sites:
http://rosasm.freeforums.org
http://winasm.tripod.com

guga

Hi guys

Can someone test the timmings for this please ?

I succeeded to rewrite a atan2 functions using SSE2 from ucrtbase.dll and wanted to know if it is really that fast as it seems (also compared to the original version as well). The function works the same as in __libm_sse2_atan2 from ucrtbase.dll. I optimized it a little bit only reorganizing the function. (The original version is a kind of a mess due to heavy spaghetti code)

I´m pretty sure it can be optimized further for speed, but i didn´t fully understood yet, how it works.

I built it on a dll just to make easier to the benchmark tests. The function is Sse2_atan2 in the test.dll. It doesn't contains any parameters.

To calculate the atan2 you need to place the values of x, y (in double) in the registers XMM0 and XMM1 (as it is on the original version).

In Rosasm syntax the function can be called as:

[MyValue1: R$ 0.554545]
[MyValue2: R$ 0.1]

movupd XMM0 X$MyValue1
movupd XMM1 X$MyValue2

call Sse2_atan2

The original version (and also mine) uses a table of tangents to calculate the atan2 of a value. The table consists of 164 pairs of Real8 values. The 1st one representing the tangent of a certain value and the next Real8 seems to be some error /difference of some sort.

For my tests, i excluded this error case and used only the true calculated values of the tangents.

This table seems to be generated on  a certain order. It consists of 164 values whose last starting from 0.02928849741073 to Pi/2. So it starts with the tangent of a certain value: Value at Pos1 = atan(0.029296875) to atan(32) (Value at Pos 163).  And the last value (Pos164) since it cannot be atan(pi/2), it restricted to  Pi/2. On each 16 values (from the last to the 1st - excepting the last one), it decreaes the value to be calculate by 1/2, starting with 1.

So, at Pos 163 = atan(32) ---> decreasing 1 on each loop until Pos 147
Pos 147 = atan(16) ---> decreasing the half of 1. So, decreasing 0.5 on the next loop untill Pos131
Pos131 = atan(8) ---> decreasing the half of 0.5. So, decreasing 0.25 on the next loop untill Pos115
etc


I built a function that reproduces the values of the original table (except the 2nd Real8 for error cases, which i filled only with 0 for now):



[MyTangTBl: R$ 0 #(164*2)]
[Float_PI: R$ 3.141592653589793238462643383279502884197169399375105820974944592307]

Proc CreateTangetTable:
    Local @Counter, @Value, @InternalCounter, @DecreaseRate
    Uses esi

    mov D@Counter 164
    mov D@InternalCounter 0
    fld1 | fstp F@DecreaseRate

    mov esi MyTangTBl
    add esi (164*2*8) | sub esi (8*2)
    fld R$Float_PI | fstp R$esi

    dec D@Counter
    sub esi (8*2)
    mov D@Value 33 | fild D@Value | fstp F@Value
    .Do

        fld D@Value | fsub F@DecreaseRate | fst F@Value | fld1 | fpatan | fstp R$esi
           
        If D@InternalCounter = 16
            fld F@DecreaseRate | fmul F$Float_half | fstp F@DecreaseRate
            mov D@InternalCounter 0
        End_If

        inc D@InternalCounter
        sub esi (8*2)
        dec D@Counter
    .Loop_Until D@Counter = 0
L1:
EndP


The original values of this table (including the error values - I didn´t included the error values on mine version, btw, but it don´t affect the precision too much) is :

DoubleData <0.02928849741073059, 3.878020342543118e-16>
DoubleData <0.03026419423862503, 2.13439966346682e-16>
DoubleData <0.03123983343026815, 1.27181094510696e-16>
DoubleData <0.03319093149711083, 7.623353902758895e-16>
DoubleData <0.03514177680279662, 1.622534112980482e-16>
DoubleData <0.03709235455039117, 6.433766265361664e-16>
DoubleData <0.03904264995516638, 6.181886837851646e-16>
DoubleData <0.04099264824526294, 8.420830387839661e-16>
DoubleData <0.04294233466236186, 3.142811786546985e-16>
DoubleData <0.04489169446234609, 4.096264921495231e-16>
DoubleData <0.04684071291596936, 2.912679762198781e-16>
DoubleData <0.04878937530951521, 4.053693341011583e-16>
DoubleData <0.05073766694545956, 6.612695435815108e-16>
DoubleData <0.05268557314312972, 3.289966758135281e-16>
DoubleData <0.05463307923935901, 4.691749851066725e-16>
DoubleData <0.05658017058914488, 8.289832696419427e-16>
DoubleData <0.05852683256630176, 1.362951662640667e-17>
DoubleData <0.06047305056410668, 6.37811249269131e-16>
DoubleData <0.06241880999595661, 7.339736781833367e-16>
DoubleData <0.06630889491982295, 5.407451321058335e-16>
DoubleData <0.07019697107187017, 3.451465030350394e-16>
DoubleData <0.07408292254903337, 3.609579312927062e-16>
DoubleData <0.0779666338315419, 4.082603982997626e-16>
DoubleData <0.08184798980307573, 8.205279419748868e-16>
DoubleData <0.08572687577074412, 6.992365845342258e-16>
DoubleData <0.08960317748487157, 1.793304156180366e-16>
DoubleData <0.09347678115858926, 2.018823445176747e-16>
DoubleData <0.09734757348722312, 5.576265782080219e-16>
DoubleData <0.101215441667466, 6.856928051415318e-16>
DoubleData <0.1050802734163288, 7.80192436556187e-16>
DoubleData <0.1089419569898658, 4.846007563068433e-17>
DoubleData <0.1128003812016587, 6.957566319383172e-16>
DoubleData <0.1166554354410687, 6.438661649715736e-16>
DoubleData <0.1205070096912237, 8.182569515647907e-16>
DoubleData <0.1243549945467608, 6.352529150170111e-16>
DoubleData <0.1320397616146387, 4.274189722415823e-17>
DoubleData <0.1397088742891635, 1.913310428846708e-16>
DoubleData <0.1473614810886508, 8.658324432321626e-16>
DoubleData <0.1549967419239406, 3.426523229816613e-16>
DoubleData <0.1626138285979479, 6.461627098025713e-16>
DoubleData <0.1702119252854741, 2.74014592076487e-16>
DoubleData <0.1777902289926754, 6.898598082898684e-16>
DoubleData <0.1853479499956947, 5.969184350010091e-17>
DoubleData <0.1928843122579744, 2.42385590364413e-16>
DoubleData <0.2003985538258783, 1.696734079809579e-16>
DoubleData <0.2078899272022623, 7.012225510572437e-16>
DoubleData <0.215357699697738, 5.59849672442657e-17>
DoubleData <0.2228011537593941, 3.830792364463579e-16>
DoubleData <0.2302195872768431, 6.229360680729788e-16>
DoubleData <0.2376123138654709, 3.71404797316887e-16>
DoubleData <0.2449786631268633, 8.156104484719729e-16>
DoubleData <0.2596296294082574, 1.30261057387131e-16>
DoubleData <0.2741674511196583, 4.523505634252264e-16>
DoubleData <0.2885873618940771, 3.187832078137744e-16>
DoubleData <0.3028848683749708, 6.551229868720925e-16>
DoubleData <0.3170557532091465, 5.361722230696518e-16>
DoubleData <0.3310960767041315, 5.471589019367845e-16>
DoubleData <0.345002177207105, 8.808349770693735e-17>
DoubleData <0.3587706702705722, 3.088733564861919e-17>
DoubleData <0.3723984466767538, 4.081903701236505e-16>
DoubleData <0.3858826693980735, 3.013439834812086e-16>
DoubleData <0.3992207695752521, 4.665551909062331e-16>
DoubleData <0.4124104415973866, 7.057684437286449e-16>
DoubleData <0.4254496373700416, 6.894493455169868e-16>
DoubleData <0.4383365598579569, 8.632356493938598e-16>
DoubleData <0.4510696559885234, 3.280735600183736e-17>
DoubleData <0.4636476090008053, 8.553660459218291e-16>
DoubleData <0.4883339510564051, 4.32715973660733e-16>
DoubleData <0.5123894603107368, 8.627156382272694e-16>
DoubleData <0.5358112379604636, 1.069585067790331e-16>
DoubleData <0.558599315343562, 4.38633579301471e-16>
DoubleData <0.5807563535676703, 9.660765868058498e-17>
DoubleData <0.6022873461349638, 3.62571214759831e-16>
DoubleData <0.6231993299340655, 4.708132487014636e-16>
DoubleData <0.6435011087932843, 1.268570875139599e-16>
DoubleData <0.6632029927060925, 7.463955685933131e-16>
DoubleData <0.6823165548747481, 6.943223671560008e-18>
DoubleData <0.7008544078844494, 7.572798548942514e-16>
DoubleData <0.7188299996216241, 4.226108214056057e-16>
DoubleData <0.7362574289814274, 7.008731912580885e-16>
DoubleData <0.7531512809621939, 5.308545776533963e-16>
DoubleData <0.7695264804056574, 8.51128500644098e-16>
DoubleData <0.7853981633974483, 3.061616997868383e-17>
DoubleData <0.8156919233162228, 6.554192491473065e-16>
DoubleData <0.8441539861131702, 8.397650495807761e-16>
DoubleData <0.8709034570756522, 7.544578813301367e-16>
DoubleData <0.8960553845713433, 6.95372577632837e-16>
DoubleData <0.9197196053504166, 1.814702107965036e-16>
DoubleData <0.9420000403794635, 1.656306773209825e-16>
DoubleData <0.9629943306809361, 1.070356418673049e-16>
DoubleData <0.9827937232473287, 3.46970218418778e-16>
DoubleData <1.001483135694234, 7.605168950105478e-16>
DoubleData <1.019141344266349, 6.761378336444607e-16>
DoubleData <1.0358412530088, 3.194313981784504e-17>
DoubleData <1.051650212548373, 5.696281674604188e-16>
DoubleData <1.066630365315743, 6.065679184034902e-16>
DoubleData <1.080839000541168, 2.063682824136722e-16>
DoubleData <1.09432890732119, 2.165539287700089e-16>
DoubleData <1.10714871779409, 9.404471373566379e-17>
DoubleData <1.13095374397916, 5.153275478954471e-16>
DoubleData <1.152571997215667, 1.304472198360275e-16>
DoubleData <1.172273881128476, 7.499857009153807e-16>
DoubleData <1.190289949682532, 7.683333629842069e-17>
DoubleData <1.206817370285252, 2.637692813136457e-16>
DoubleData <1.222025323210989, 6.363421861261654e-16>
DoubleData <1.236059489478081, 7.449313421696882e-16>
DoubleData <1.249045772398254, 8.859822159005129e-16>
DoubleData <1.26109338225244, 2.544660011403809e-16>
DoubleData <1.272297395208717, 2.245875015034507e-17>
DoubleData <1.28274087974427, 8.788952309458591e-16>
DoubleData <1.292496667789785, 1.537365572357647e-16>
DoubleData <1.301628834009196, 2.09675419926785e-16>
DoubleData <1.310193935047555, 7.535879521228967e-16>
DoubleData <1.318242051016837, 3.808952695386159e-16>
DoubleData <1.325817663668032, 1.3380031118552e-16>
DoubleData <1.339705659598999, 6.401436961720526e-16>
DoubleData <1.352127380920955, 2.147674250751151e-17>
DoubleData <1.363300100359694, 3.313692220777249e-16>
DoubleData <1.373400766945015, 6.330567112173988e-16>
DoubleData <1.382574821490126, 1.86429700538549e-16>
DoubleData <1.390942827002418, 7.897412983652368e-16>
DoubleData <1.398605512271957, 4.208485980241463e-16>
DoubleData <1.40564764938027, 1.328183035426864e-16>
DoubleData <1.412141064608495, 5.703957436695217e-16>
DoubleData <1.418146998399631, 8.055395818750151e-16>
DoubleData <1.423717971406494, 3.09263314147271e-16>
DoubleData <1.428899272190733, 1.574732574926438e-16>
DoubleData <1.433730152484708, 8.442163750324489e-16>
DoubleData <1.438244794498222, 6.412036156724483e-16>
DoubleData <1.442473099109101, 7.776664761570937e-16>
DoubleData <1.446441332248135, 3.141578446404818e-16>
DoubleData <1.453687582228032, 8.199907271986118e-16>
DoubleData <1.460139105621001, 2.846543520033397e-16>
DoubleData <1.465919388064663, 1.24932049384283e-16>
DoubleData <1.471127674303734, 7.79686192079511e-16>
DoubleData <1.47584462045214, 4.779648065777258e-16>
DoubleData <1.480136439594151, 3.07070859685828e-16>
DoubleData <1.484057988118911, 4.096346991714267e-16>
DoubleData <1.487655094906455, 7.844371739461077e-17>
DoubleData <1.490966341082659, 6.221434760020122e-17>
DoubleData <1.494024435525118, 8.134142446723606e-16>
DoubleData <1.496857289136956, 6.830938909876451e-16>
DoubleData <1.499488862009605, 8.873091857396741e-16>
DoubleData <1.501939837493851, 7.267528106158521e-16>
DoubleData <1.504228163019072, 4.532269913842895e-16>
DoubleData <1.506369487369343, 5.605999046418969e-16>
DoubleData <1.508377516798939, 2.154370814741562e-16>
DoubleData <1.512040504079174, 1.402618800554007e-16>
DoubleData <1.515297821549179, 5.368157936821226e-16>
DoubleData <1.518213265183954, 7.375391359310956e-16>
DoubleData <1.520837931072953, 6.825613612540716e-16>
DoubleData <1.523213223517913, 4.501543596256141e-16>
DoubleData <1.52537304737332, 2.482983385700394e-17>
DoubleData <1.527345431403365, 7.934779986221159e-16>
DoubleData <1.529153747696308, 6.760940733854243e-16>
DoubleData <1.530817639671606, 3.549557335150754e-16>
DoubleData <1.532353736773708, 2.956322283362237e-16>
DoubleData <1.533776210920966, 9.377354806572844e-17>
DoubleData <1.535097214115572, 7.767503702335957e-16>
DoubleData <1.536327225795388, 8.864446875291863e-16>
DoubleData <1.537475330916649, 8.116858600761242e-17>
DoubleData <1.538549444359642, 5.614707476320801e-16>
DoubleData <1.539556493364628, 8.22229665146797e-16>
DoubleData <1.570796326794897, 6.123233995736766e-17>


The source is embedded in the dll but i can post it here later if the tests are ok. It shouldn´t be hard to port to masm, if someone is interested to work further with the function.

As the original version it can only work with 1 calculated value at once. (Double). But, i presume it could be optimized to handle 4 Floats at once to gain even more speed.
Coding in Assembly requires a mix of:
80% of brain, passion, intuition, creativity
10% of programming skills
10% of alcoholic levels in your blood.

My Code Sites:
http://rosasm.freeforums.org
http://winasm.tripod.com

guga

Btw, to convert it to degree you can also use this:



[Float_AtanPiFactor: R$ (180/3.1415926535897932384626433832795)]
[Float360: R$ 360]
[Float_Zero: R$ 0]

Proc SSE_Atan2Double:
    Arguments @pY, @pX, @ConvDegree
    Uses eax, ebx, ecx

    mov ebx D@pY
    mov ecx D@pX

    movupd XMM0 X$ebx
    movupd XMM1 X$ecx
    ;call 'ucrtbase.__libm_sse2_atan2' <---original version inside ucrtbase.dll
    call Sse2_atan2 <---mine tests.

    .If D@ConvDegree = &TRUE ; <--- Flag to convert to degree (True/False)
        movupd XMM1 X$Float_AtanPiFactor
        mulpd XMM0 XMM1
        xorpd xmm2 xmm2
        SSE_D_If xmm0 < xmm2 ; A simple macro to handle sse2 comparisons. It is simply: COMISD XMM0 XMM2 | JNB D0> So, it is comisd followed by the corresponding jmp to D0 (Our Else_IF macro)
            addpd xmm0 X$Float360
        SSE_D_Else_If xmm0 > X$Float360 ; same as above:     JMP D1> | D0:COMISD XMM0 X$FLOAT360 | JNA D0>
            subpd xmm0 X$Float360
        SSE_D_End_If ; The ending of the jumps D0: D1:
    .End_If

    mov eax &TRUE

EndP


Example of usage:

[TestingAtan2Data1: R$ 0.5317094316614787480759158718400589803054643151426570508592559650]
[TestingAtan2Data2: R$ 1]

call SSE_Atan2Double TestingAtan2Data1, TestingAtan2Data2, &TRUE

It will return in ST0 = 28º
Coding in Assembly requires a mix of:
80% of brain, passion, intuition, creativity
10% of programming skills
10% of alcoholic levels in your blood.

My Code Sites:
http://rosasm.freeforums.org
http://winasm.tripod.com

jj2007

Quote from: guga on February 08, 2022, 11:10:30 PM
Can someone test the timmings for this please ?

It's a DLL, Gustavo... how are we supposed to test it?

guga

The same way as you did for the FastMath. Didn´t you used a external dll to call the function ? The benchmark apps can handle external dlls, right ?

something like

importlib test.lib

invoke Sse2_atan2

But, I can try disassemble it and create a source compatible to masm, if needed.
Coding in Assembly requires a mix of:
80% of brain, passion, intuition, creativity
10% of programming skills
10% of alcoholic levels in your blood.

My Code Sites:
http://rosasm.freeforums.org
http://winasm.tripod.com

guga

Please, see if this helps (It´s the disassembled source)
Coding in Assembly requires a mix of:
80% of brain, passion, intuition, creativity
10% of programming skills
10% of alcoholic levels in your blood.

My Code Sites:
http://rosasm.freeforums.org
http://winasm.tripod.com

jj2007

Quote from: guga on February 09, 2022, 05:13:51 AM
The same way as you did for the FastMath. Didn´t you used a external dll to call the function ? The benchmark apps can handle external dlls, right ?

something like

importlib test.lib

invoke Sse2_atan2

It can, Guga, but I need a minimum of documentation. Something like atan2=Sse2_atan2(double X, double Y)?
For now,
  Dll "GugaAtan2"
  Declare Sse2_atan2, 2


fails miserably saying "can't find module Sse2_atan2" :sad:

HSE

Working here! (using source code)
No timing yet

erased a minor detail :biggrin:
Equations in Assembly: SmplMath

guga

Quote from: jj2007 on February 09, 2022, 06:04:39 AM
Quote from: guga on February 09, 2022, 05:13:51 AM
The same way as you did for the FastMath. Didn´t you used a external dll to call the function ? The benchmark apps can handle external dlls, right ?

something like

importlib test.lib

invoke Sse2_atan2

It can, Guga, but I need a minimum of documentation. Something like atan2=Sse2_atan2(double X, double Y)?
For now,
  Dll "GugaAtan2"
  Declare Sse2_atan2, 2


fails miserably saying "can't find module Sse2_atan2" :sad:

That´s weird. The exported function i named as "Sse2_atan2". Are you sure you are importing test.dll ? Because the name of the dll is not "GugaAtan2"

Shouldn´t it be ?

Dll "test"
Declare Sse2_atan2, 1

The ordinal value of the function inside the dll is 1 and not 2
Coding in Assembly requires a mix of:
80% of brain, passion, intuition, creativity
10% of programming skills
10% of alcoholic levels in your blood.

My Code Sites:
http://rosasm.freeforums.org
http://winasm.tripod.com

guga

Quote from: HSE on February 09, 2022, 07:23:02 AM
Working here!
No timing yet

erased a minor detail :biggrin:

Tks HSE  :thumbsup: :thumbsup: :thumbsup:
Coding in Assembly requires a mix of:
80% of brain, passion, intuition, creativity
10% of programming skills
10% of alcoholic levels in your blood.

My Code Sites:
http://rosasm.freeforums.org
http://winasm.tripod.com

guga

Quote from: jj2007 on February 09, 2022, 06:04:39 AM
Quote from: guga on February 09, 2022, 05:13:51 AM
The same way as you did for the FastMath. Didn´t you used a external dll to call the function ? The benchmark apps can handle external dlls, right ?

something like

importlib test.lib

invoke Sse2_atan2

It can, Guga, but I need a minimum of documentation. Something like atan2=Sse2_atan2(double X, double Y)?
For now,
  Dll "GugaAtan2"
  Declare Sse2_atan2, 2


fails miserably saying "can't find module Sse2_atan2" :sad:

Sorry, i forgot to answer this

"Sse2_atan2(double X, double Y)?"

Yes, the contents of xmm1 and xmm0 are doubles.(The same as in the original library - ucrtbase.dll). The function does not use any parameter, but you must feed  xmm0 and xmm1 registers with double values.

Internally i rewrote the function as this:


; The table was generated as CreateTangetTable
; I rewrote/reorganized the whole function to avoid heavy usage of spaghetti code as in the original version inside ucrtbase.dll

Proc Sse2_atan2::

    movlpd X$SSEAtanTmpVal1 xmm0 ; <-----Temporary global variable to store the data to be used in the internal functions
    movlpd X$SSEAtanTmpVal2 xmm1 ; <-----Temporary global variable to store the data to be used in the internal functions

    pextrw eax xmm0 3 | and eax 07FF0 | sub eax 03870
    pextrw ecx xmm1 3 | and ecx 07FF0 | sub ecx 03870

    .If_And eax <= 0F00, ecx <= 0F00
        call Sse2_atan2Internal4
    .Else
        pextrw ecx xmm0 3 | and ecx 07FF0
        pextrw eax xmm1 3 | and eax 07FF0
        If ecx = 07FF0
            call Sse2_atan2Internal1
        Else_If eax = 07FF0
            call Sse2_atan2Internal2
        Else
            call Sse2_atan2Internal5
        End_If
    .End_If

EndP
Coding in Assembly requires a mix of:
80% of brain, passion, intuition, creativity
10% of programming skills
10% of alcoholic levels in your blood.

My Code Sites:
http://rosasm.freeforums.org
http://winasm.tripod.com

jj2007

Quote from: guga on February 09, 2022, 07:30:32 AMDll "test"
Declare Sse2_atan2, 1

The ordinal value of the function inside the dll is 1 and not 2

I changed the name of the DLL to avoid ambiguities. Declare expects the number of args, but I found out debugging your DLL that you pass the two arguments in xmm0 and xmm1 (which is pretty unorthodox).

So, here are results for Sse2_atan2(1.0, 2.0) - as compared to ucrtbase atan2:

Intel(R) Core(TM) i5-2450M CPU @ 2.50GHz (SSE4)

6472    cycles for 100 * Sse2_atan2
17801   cycles for 100 * ucrtbase atan2

6443    cycles for 100 * Sse2_atan2
17799   cycles for 100 * ucrtbase atan2

6424    cycles for 100 * Sse2_atan2
17810   cycles for 100 * ucrtbase atan2

6423    cycles for 100 * Sse2_atan2
17841   cycles for 100 * ucrtbase atan2

123     bytes for Sse2_atan2
127     bytes for ucrtbase atan2

Real8   -56.30993247402027180   Sse2_atan2
Real8   -56.30993247402021495   ucrtbase atan2


The two calls are as follows:

  Dll "GugaAtan2"
  Declare Sse2_atan2, 0   ; 0 arguments because they are passed in xmm0 and xmm1
  Dll "ucrtbase"
  Declare atan2, C:2      ; 2 arguments, C calling convention
...
movlps xmm0, FP8(-33.0)
movlps xmm1, FP8(22.0)
void Sse2_atan2()

void atan2(FP8(-33.0), FP8(22.0))


See TestA and TestB for differences in the way the two algos return their results.