News:

Masm32 SDK description, downloads and other helpful links
Message to All Guests

Main Menu

Atan2 SSE2

Started by guga, February 07, 2022, 08:27:40 AM

Previous topic - Next topic

guga

Hi JJ. Many tks. It seems pretty fast :azn: :azn: :azn:

I know it is an  unorthodox way, but i just wanted to test the functionality on the exact way as it is in the original dll (ucrtbase). To use regular parameters, i created another function to test like this:



[Float_AtanPiFactor: R$ (180/3.1415926535897932384626433832795)]
[Float360: R$ 360]
[Float_Zero: R$ 0]


pY(Input) = Pointer to a Double (Real8) value
pX(input) = Pointer to a Double (Real8) value
ConvDegree - A Flag to convert the output to degree or radian. True =  convert to degree. Radian, otherwise

Proc SSE_AtanEx:
    Arguments @pY, @pX, @ConvDegree
    Uses eax, ebx, ecx, edx

    mov ebx D@pY
    mov ecx D@pX

    movupd XMM0 X$ebx
    movupd XMM1 X$ecx
    call  Sse2_atan2  ;<------ call the main function

    .If D@ConvDegree = &TRUE
        movupd XMM1 X$Float_AtanPiFactor
        mulpd XMM0 XMM1
        movupd XMM1 X$Float360
        xorpd xmm2 xmm2
        SSE_D_If xmm0 < xmm2  ; Simple SSE comparition macro as explained earlier. COMISD XMM0 XMM2 | jnb D0> ......
            addpd xmm0 xmm1
        SSE_D_Else_If xmm0 > xmm1 ; Same as above
            subpd xmm0 xmm1
        SSE_D_End_If ; endpoint of the comparition macros. Those macros works the same as the regular If/Else/Else_If, but focused to SSE registers.
    .End_If

    mov eax &TRUE

EndP


Btw..later both functions can be joined to allow the usage of parameters on a single function, but i wanted to test the speed and accuracy of this routine 1st.
Coding in Assembly requires a mix of:
80% of brain, passion, intuition, creativity
10% of programming skills
10% of alcoholic levels in your blood.

My Code Sites:
http://rosasm.freeforums.org
http://winasm.tripod.com

guga

Quote from: jj2007 on February 09, 2022, 08:10:55 AM
Quote from: guga on February 09, 2022, 07:30:32 AMDll "test"
Declare Sse2_atan2, 1

The ordinal value of the function inside the dll is 1 and not 2

I changed the name of the DLL to avoid ambiguities. Declare expects the number of args, but I found out debugging your DLL that you pass the two arguments in xmm0 and xmm1 (which is pretty unorthodox).

So, here are results for Sse2_atan2(1.0, 2.0) - as compared to ucrtbase atan2:

Intel(R) Core(TM) i5-2450M CPU @ 2.50GHz (SSE4)
6405    cycles for 100 * Sse2_atan2
35327   cycles for 100 * ucrtbase atan2

6407    cycles for 100 * Sse2_atan2
35314   cycles for 100 * ucrtbase atan2

6420    cycles for 100 * Sse2_atan2
35373   cycles for 100 * ucrtbase atan2

6426    cycles for 100 * Sse2_atan2
35292   cycles for 100 * ucrtbase atan2

105     bytes for Sse2_atan2
108     bytes for ucrtbase atan2

Real8   0.4636476090008099793   Sse2_atan2
Real8   1.570796326794896558    ucrtbase atan2


Results are different, so I probably got some arguments wrong :sad:

About the results being different, maybe it is reusing xmm0 and xmm1 when the functions are called.

Here it works ok if i only do this:


[TestingAtan2Data1: R$ 1.0]
[TestingAtan2Data2: R$ 2.0]

    movupd XMM0 X$TestingAtan2Data1
    movupd XMM1 X$TestingAtan2Data2
    call 'ucrtbase.__libm_sse2_atan2'


    movupd XMM0 X$TestingAtan2Data1
    movupd XMM1 X$TestingAtan2Data2
    call Sse2_atan2


xmm0 = 4.63647609000806093e-1 (ucrtbase)
xmm0 = 4.63647609000809979e-1 (mine version).

The small difference is due to the rounding error that i removed while i was testing the functionality. If you use the complete tangent table, then it should generate the same result. Since, i´m still testing it, i´m unsure how this whole thing works exactly.

I made some tests and placed the results of the tangent table on excel and it seems to be a sort of rotating table that the function is using. I´m trying to find if there is a direct relation between the table values and the resultant data, to see if the speed can be improved further, simply making pointers to the table and some small calculations. But, i still have no idea exactly how this function really works.
Coding in Assembly requires a mix of:
80% of brain, passion, intuition, creativity
10% of programming skills
10% of alcoholic levels in your blood.

My Code Sites:
http://rosasm.freeforums.org
http://winasm.tripod.com

HSE

Impressive  :thumbsup:

Intel(R) Core(TM) i3-10100 CPU @ 3.60GHz (SSE4)

6783    cycles for 100 * Sse2_atan2
11995   cycles for 100 * fpatan

6783    cycles for 100 * Sse2_atan2
11999   cycles for 100 * fpatan

6798    cycles for 100 * Sse2_atan2
12005   cycles for 100 * fpatan

6786    cycles for 100 * Sse2_atan2
11989   cycles for 100 * fpatan

29      bytes for Sse2_atan2
20      bytes for fpatan

Real8   0.4636476090008099793   Sse2_atan2
Real8   0.4636476090008060935   fpatan


This include result storage.
Equations in Assembly: SmplMath

guga

Tks HSE :azn: :azn: :azn:

Sorry for the syntax. I´m used to RosAsm that´s why i have some difficulties in port the code, but it seems to work as expected with minor differences then the original version. You can rewrite the code as you wish.I believe there´s a way to make it even faster, but it is necessary to understand what exactly the whole function is doing. It uses a table of tangents created on a certain way, displaced on 164 values, with a separation of 16 "indexes". I don´t know why it is using only 164 indexes rather then the 176  = 11 indexes *16 (closer to 180º). In excel there seems to be a sort of relation between the indexes and the resultants value, but i´m not sure yet.

Once the whole function is fully understand then, perhaps optimization could be done, forcing the function to stay in only 1 quadrant (90º) and identifying the changes of sign in other quadrants and simply add or subtract 360º from the result. Don´t know yet.

There are other functions inside ucrtbase that seems really good, but the problems is that the way they were written kills speed due to heavy spaghetti code. On the original libm_sse2_atan2, it contains jumps that go back and forward the entire function, and also some pointers to other frames (other functions) as well. What i did was reorganize the whole code and create the proper sub-functions avoiding the heavy usage of spaghetti code inside.
Coding in Assembly requires a mix of:
80% of brain, passion, intuition, creativity
10% of programming skills
10% of alcoholic levels in your blood.

My Code Sites:
http://rosasm.freeforums.org
http://winasm.tripod.com

jj2007

Hi Guga,

It seems you had an earlier version (I edited the post). Here is the latest, with HSE's fpatan addition:

Intel(R) Core(TM) i5-2450M CPU @ 2.50GHz (SSE4)

6923    cycles for 100 * Sse2_atan2
18360   cycles for 100 * ucrtbase atan2
11834   cycles for 100 * fpatan

6434    cycles for 100 * Sse2_atan2
18435   cycles for 100 * ucrtbase atan2
11476   cycles for 100 * fpatan

6430    cycles for 100 * Sse2_atan2
17775   cycles for 100 * ucrtbase atan2
11868   cycles for 100 * fpatan

6429    cycles for 100 * Sse2_atan2
17821   cycles for 100 * ucrtbase atan2
11493   cycles for 100 * fpatan

123     bytes for Sse2_atan2
127     bytes for ucrtbase atan2
110     bytes for fpatan

Real8   -56.30993247402027180   Sse2_atan2
Real8   -56.30993247402021495   ucrtbase atan2
Real8   -56.30993247402021495   fpatan

guga

Tks a lot, JJ

I´ll take a look. There is a small difference in the results, but it´s due to the rounding error that i removed earlier.

I´m trying to understand the math behind this, and why the function is using things like:
pextrw eax xmm0 3 | and eax 07FF0 | sub eax 03870
...

mov eax 0800 | pinsrw xmm5 eax 3

etc etc.

before it reaches what is really necessary to compute the atan2, i mean, before it do things like divsd xmm4 xmm1 etc.

If i could understand what´s going on before it reaches the SSe division, then maybe we could gain more speed, simply making the division of xmm0 / xmm1 at 1st and then do minor calculations before it reaches the tangent table to later adjust the remainder value.

For me, it´s a bit weird it´s using an error value in the 2nd Real8 on the tangent table, because the values of the 1st and 2nd Real8 are complementary, so, it seems useless calculating all of this feeding error values. What´s also weird is that the original version from ucrtbase and fpatan produces some smaller differences then the values found in wolfram-alpha, for example.
Coding in Assembly requires a mix of:
80% of brain, passion, intuition, creativity
10% of programming skills
10% of alcoholic levels in your blood.

My Code Sites:
http://rosasm.freeforums.org
http://winasm.tripod.com

guga

Btw...follow the results i´ve got in your last version

:dazzled: :dazzled: :dazzled: :dazzled:

AMD Ryzen 5 2400G with Radeon Vega Graphics     (SSE4)

6931    cycles for 100 * Sse2_atan2
69442   cycles for 100 * ucrtbase atan2
14871   cycles for 100 * fpatan

7411    cycles for 100 * Sse2_atan2
69354   cycles for 100 * ucrtbase atan2
14886   cycles for 100 * fpatan

6994    cycles for 100 * Sse2_atan2
68109   cycles for 100 * ucrtbase atan2
14957   cycles for 100 * fpatan

7040    cycles for 100 * Sse2_atan2
67606   cycles for 100 * ucrtbase atan2
14731   cycles for 100 * fpatan

123     bytes for Sse2_atan2
127     bytes for ucrtbase atan2
110     bytes for fpatan

Real8   -56.30993247402027180   Sse2_atan2
Real8   -56.30993247402021495   ucrtbase atan2
Real8   -56.30993247402021495   fpatan

--- ok ---


Now i can try to make a small test on the routine i was doing for marinu´s Fir Algo and FFT Analyser to retrieve the value of phase of FFT Frequencies :)
Coding in Assembly requires a mix of:
80% of brain, passion, intuition, creativity
10% of programming skills
10% of alcoholic levels in your blood.

My Code Sites:
http://rosasm.freeforums.org
http://winasm.tripod.com

HSE

Quote from: guga on February 09, 2022, 09:42:00 AM
then maybe we could gain more speed

The reason that SSE algorithm is fast than FPU is because have less precision. While FPU use 8-9 degree polinomies, SSE algorithms are using 5-6 degrees (if I remeber well some discussion somewhere in this forum).

If you can accept less precision perhaps You can use a faster algorithm.

Then perhaps You can build a hyperfaster low precision library for previews, and a best precision library for final works.

Equations in Assembly: SmplMath

guga

I don´t know if the precision is less accurate. On the original version It seems to produce the same result as in FPU. I finished some minor optimizations both for speed and accuracy. Now it seems to result the same value as the original one, and perhaps a bit more fast.

The function (and the original) uses an approximation that seems to be biased in:
https://www.dsprelated.com/showarticle/1052.php

But, it works better, since it is more accurate then the results of the link. It is using a rotating table of tangents. On my last version i removed several useless SSE checks, including  the ones at the start of the function, replaced them with the regular opcodes counterpart (mov eax etc) and forced the routines to simply point to the proper masks. I´m currently trying to understand how exactly one of the internal function works, "Sse2_atan2Internal4", because this is where the table of tangents are being used. It gets a sort of index to calculate the value, and maybe we can create a similar binary search function as JJ did sometime ago to point to the indexes of the table or something like that.

I´m currently labeling the variables and forcing the function to use the proper parameters rather then xmm0/xmm1 directly and also creating another function to retrieve the resul in Angles too. Once i finish i´ll upload it here to we perform the benchmarks. :smiley: :smiley:
Coding in Assembly requires a mix of:
80% of brain, passion, intuition, creativity
10% of programming skills
10% of alcoholic levels in your blood.

My Code Sites:
http://rosasm.freeforums.org
http://winasm.tripod.com

jj2007

Quote from: guga on February 10, 2022, 02:09:47 AM...table of tangents are being used. It gets a sort of index to calculate the value, and maybe we can create a similar binary search function as JJ did sometime ago to point to the indexes of the table or something like that.

Quote from: jj2007 on February 07, 2022, 09:54:42 AMIf your problem is to find atan2(X,Y), and either X or Y is a constant, then FastMath would be an option.

guga

#25
Tks, JJ.

I couldn´t understand how to test FastMath. Also, isn´t it using the windows dll  (crt or ucrtbase) underneath ?

Btw...here goes a new version. I made some modifications as you suggested and make it work with regular parameters. One thing to consider when assembling this. The global variables Atan2_Mask1 to Atan2_Mask12 must be aligned to a 16 byte-boundary, otherwise the app will crash. This is because they are being used as variables to things like minsd, andpd etc

I added 4 functions (Renamed the previous function . "D" for double for easier comprehension):

SSE_Atan2DEx: Retrieves the atan2 of a value and also can convert the result to Degree or Radian.
Uses 3 parameters:
pY - Pointer to a Real8 value
pX - Pointer to a Real8 value
ConvDegree- A Flag (integer) to convert the result to degrees. TRUE = Degrees. FALSE = Radians.

Return Value: The resultant value is located in xmm0 register


SSE_Atan2D: Retrieves the atan2 of a value.
Uses 2 parameters:
pY - Pointer to a Real8 value
pX - Pointer to a Real8 value

Return Value: The resultant value is located in xmm0 register (In radians)

Atan2: Retrieves the atan2 of a value and also can convert the result to Degree or Radian. Same as SSE_Atan2DEx, but using FPU rather then SSE2
Uses 3 parameters:
pY - Pointer to a Real8 value
pX - Pointer to a Real8 value
ConvDegree- A Flag (integer) to convert the result to degrees. TRUE = Degrees. FALSE = Radians.

Return Value: The resultant value is located in ST0 register

CreateTangetTable This was for my tests only. It generates a table of tangents used in SSE_Atan2DEx function. The function uses no parameters, and the resultant values are stored in MyTangTBl variable
Coding in Assembly requires a mix of:
80% of brain, passion, intuition, creativity
10% of programming skills
10% of alcoholic levels in your blood.

My Code Sites:
http://rosasm.freeforums.org
http://winasm.tripod.com

guga

Typo:

The SSE_Atan2D uses 2 parameters and not 3 as i wrote in the previous post. Post now edited :smiley:
Coding in Assembly requires a mix of:
80% of brain, passion, intuition, creativity
10% of programming skills
10% of alcoholic levels in your blood.

My Code Sites:
http://rosasm.freeforums.org
http://winasm.tripod.com

jj2007

Quote from: guga on February 10, 2022, 09:47:29 AM
I couldn´t understand how to test FastMath. Also, isn´t it using the windows dll  (crt or ucrtbase) underneath ?

No, it doesn't use crt or similar.

Quote from: jj2007 on February 07, 2022, 09:54:42 AMIf your problem is to find atan2(X,Y), and either X or Y is a constant, then FastMath would be an option.

But FastMath is limited to one argument functions, that's why I asked what your usage of that function is. Anyway, I attach a new version that includes FastMath with a fixed second argument:

Intel(R) Core(TM) i5-2450M CPU @ 2.50GHz (SSE4)

6699    cycles for 100 * Sse2_atan2
18672   cycles for 100 * ucrtbase atan2
11874   cycles for 100 * fpatan
1568    cycles for 100 * FastAtan

6511    cycles for 100 * Sse2_atan2
18258   cycles for 100 * ucrtbase atan2
11638   cycles for 100 * fpatan
1426    cycles for 100 * FastAtan

6581    cycles for 100 * Sse2_atan2
18033   cycles for 100 * ucrtbase atan2
11706   cycles for 100 * fpatan
1477    cycles for 100 * FastAtan

6574    cycles for 100 * Sse2_atan2
18159   cycles for 100 * ucrtbase atan2
11986   cycles for 100 * fpatan
1450    cycles for 100 * FastAtan

123     bytes for Sse2_atan2
127     bytes for ucrtbase atan2
106     bytes for fpatan
269     bytes for FastAtan

Real8   -56.30993247402027180   Sse2_atan2
Real8   -56.30993247402021495   ucrtbase atan2
Real8   -56.30993247402021495   fpatan
Real8   -56.30993247402021495   FastAtan


Here is the FastMath usage, first the creation (once):
  FastMath FastAtan ; define a math function
For_ fct=-40.0 To 10.0 Step 0.5 ; max 10,000 iterations
fld fct ; X
fstp REAL10 ptr [edi]
fld fct
fld FP8(22.0)
fpatan    ; Masm32 algo
fstp REAL10 ptr [edi+REAL10]
add edi, 2*REAL10
Next
  FastMath


And here the usage in the loop, see TestD:

  mov ebx, AlgoLoops-1 ; loop e.g. 100x
  align 4
  .Repeat
void FastAtan(-33.0)
dec ebx
.if !Sign?
fstp st
.endif
  .Until Sign?


Timings for the creation of the FastMath table (it could be done many times in an outer loop for the second argument):

39 µs for creating FastMath
28 µs for creating FastMath
23 µs for creating FastMath
26 µs for creating FastMath
23 µs for creating FastMath
25 µs for creating FastMath
23 µs for creating FastMath
25 µs for creating FastMath
23 µs for creating FastMath
23 µs for creating FastMath

guga

WOW !

Tks JJ. But, i think you used the old version. The new one (That uses parameters) is the one attached previously (also attached here). The name of the function is SSE_Atan2DEx (Uses 3 parameters to convert to degree) or SSE_Atan2D (Using only 2 parameters. No degree conversion)

I tried to assemble your file with RichMasm, but got this msg

*** Start C:\masm32\MasmBasic\Res\bldallRM.bat ***
*** 32-bit assembly ***


*** Assemble, link and run GugaAtan2 ***

*** Assemble using ml  ***
Assembling: tmp_file.asm
##########################

You CANNOT use the MasmBasic library with
the old ml.exe version 6.14 - use UAsm instead

##########################
tmp_file.asm(13) : error A2008: syntax error : StackWalk
*** MasmBasic version 18.02.2021 ***
*** Warning: SQWORD is unsigned with ml **
\masm32\MasmBasic\MasmBasic.inc(728) : error A2052: forced error
TestMasmVersion(8): Macro Called From
  \masm32\MasmBasic\MasmBasic.inc(728): Include File
*** Assembly error ***



I do have uasm installed in the masm directory "C:\masm32\bin\uasm64.exe". How do i configure richmasm to assemble the file using uasm, so i can try to assemble the file as well ?
Coding in Assembly requires a mix of:
80% of brain, passion, intuition, creativity
10% of programming skills
10% of alcoholic levels in your blood.

My Code Sites:
http://rosasm.freeforums.org
http://winasm.tripod.com

jj2007

#29
Quote from: guga on February 11, 2022, 02:00:17 AMI tried to assemble your file with RichMasm, but got this msg

Oops, my fault:
- comment out StackWalk in line 13 (that's a new macro, not yet included in the current MasmBasic version)
- use OPxT_Assembler at the end of the file (the x disables the option; your ml.exe is the old 6.14)

New version below. Creating the FastMath table for a new second argument costs 12 microseconds on average. You still have not told us how you will use this; since you are interested in speed, I suppose you have an innermost plus some outer loop. Can the outer loop work with a 12 microseconds delay?

12 us on average for creating FastMath
Intel(R) Core(TM) i5-2450M CPU @ 2.50GHz (SSE4)

6417    cycles for 100 * SSE_Atan2D_new
18184   cycles for 100 * ucrtbase atan2
11387   cycles for 100 * fpatan
1438    cycles for 100 * FastAtan

6273    cycles for 100 * SSE_Atan2D_new
18093   cycles for 100 * ucrtbase atan2
11382   cycles for 100 * fpatan
1398    cycles for 100 * FastAtan

6171    cycles for 100 * SSE_Atan2D_new
18259   cycles for 100 * ucrtbase atan2
11396   cycles for 100 * fpatan
1398    cycles for 100 * FastAtan

6378    cycles for 100 * SSE_Atan2D_new
18089   cycles for 100 * ucrtbase atan2
11382   cycles for 100 * fpatan
1398    cycles for 100 * FastAtan

49      bytes for SSE_Atan2D_new
51      bytes for ucrtbase atan2
30      bytes for fpatan
193     bytes for FastAtan

Real8   -56.30993247402021495   SSE_Atan2D_new
Real8   -56.30993247402021495   ucrtbase atan2
Real8   -56.30993247402021495   fpatan
Real8   -56.30993247402021495   FastAtan