Hi Guys
Someone succeeded to create a atan2 function similar (or even better) to the ones existent in Intel ?
https://www.intel.com/content/www/us/en/develop/documentation/cpp-compiler-developer-guide-and-reference/top/compiler-reference/intrinsics/intrinsics-for-short-vector-math-library-ops/intrinsics-for-trigonometric-operations/mm-atan2-pd-mm256-atan2-pd.html
Have you seen How to Find a Fast Floating-Point atan2 Approximation (https://www.dsprelated.com/showarticle/1052.php)?
If your problem is to find atan2(X,Y), and either X or Y is a constant, then FastMath (http://masm32.com/board/index.php?topic=8779.0) would be an option.
Here is one that I don't understand (https://gist.github.com/tobin/2028637) :sad:
Hi JJ. I´ll give a try.
Also, ucrtbase.dll have a atan2 function using SSE2 ( ___libm_sse2_atan2), but i didn´t tested yet for speed. I wonder if there is a faster way to do it. For the normal way using FPU, i did some routine years ago that can also convert to Hue angles. The problem is that it needs to be fst, and i presume the FPU way is not fast enough, specially when i plan to use it with Marinus FFT routines for audio.
[Float_AtanPiFactor: R$ (180/3.1415926535897932384626433832795)]
[Float360: R$ 360]
Proc atan2:
Arguments @pY, @pX, @ConvDegree
Structure @TempStorage 16, @TmpDataDis 0
Uses eax, ebx, ecx
mov ebx D@pY
mov ecx D@pX
fld R$ebx
fld R$ecx
fpatan
fstsw ax
wait
shr ax 1
jnb L2>
fclex | stc | xor eax eax | ExitP
L2:
.If D@ConvDegree = &TRUE
fmul R$Float_AtanPiFactor | fst R@TmpDataDis
Fpu_If R@TmpDataDis < R$FloatZero
fadd R$Float360
Fpu_Else_If R@TmpDataDis >= R$Float360
fsub R$Float360
Fpu_End_If
.End_If
clc
mov eax &TRUE
EndP
Found something interesting using polynomials
https://mazzo.li/posts/vectorized-atan2.html
https://pub.dev/documentation/complex/latest/fastmath/atan.html
https://www.dsprelated.com/showarticle/1052.php
https://opensource.apple.com/source/Libm/Libm-315/Source/Intel/atan.c
https://stackoverflow.com/questions/11930594/calculate-atan2-without-std-functions-or-c99
https://www.examplefiles.net/cs/453132
Hi guys
Can someone test the timmings for this please ?
I succeeded to rewrite a atan2 functions using SSE2 from ucrtbase.dll and wanted to know if it is really that fast as it seems (also compared to the original version as well). The function works the same as in __libm_sse2_atan2 from ucrtbase.dll. I optimized it a little bit only reorganizing the function. (The original version is a kind of a mess due to heavy spaghetti code)
I´m pretty sure it can be optimized further for speed, but i didn´t fully understood yet, how it works.
I built it on a dll just to make easier to the benchmark tests. The function is Sse2_atan2 in the test.dll. It doesn't contains any parameters.
To calculate the atan2 you need to place the values of x, y (in double) in the registers XMM0 and XMM1 (as it is on the original version).
In Rosasm syntax the function can be called as:
[MyValue1: R$ 0.554545]
[MyValue2: R$ 0.1]
movupd XMM0 X$MyValue1
movupd XMM1 X$MyValue2
call Sse2_atan2
The original version (and also mine) uses a table of tangents to calculate the atan2 of a value. The table consists of 164 pairs of Real8 values. The 1st one representing the tangent of a certain value and the next Real8 seems to be some error /difference of some sort.
For my tests, i excluded this error case and used only the true calculated values of the tangents.
This table seems to be generated on a certain order. It consists of 164 values whose last starting from 0.02928849741073 to Pi/2. So it starts with the tangent of a certain value: Value at Pos1 = atan(0.029296875) to atan(32) (Value at Pos 163). And the last value (Pos164) since it cannot be atan(pi/2), it restricted to Pi/2. On each 16 values (from the last to the 1st - excepting the last one), it decreaes the value to be calculate by 1/2, starting with 1.
So, at Pos 163 = atan(32) ---> decreasing 1 on each loop until Pos 147
Pos 147 = atan(16) ---> decreasing the half of 1. So, decreasing 0.5 on the next loop untill Pos131
Pos131 = atan(8) ---> decreasing the half of 0.5. So, decreasing 0.25 on the next loop untill Pos115
etc
I built a function that reproduces the values of the original table (except the 2nd Real8 for error cases, which i filled only with 0 for now):
[MyTangTBl: R$ 0 #(164*2)]
[Float_PI: R$ 3.141592653589793238462643383279502884197169399375105820974944592307]
Proc CreateTangetTable:
Local @Counter, @Value, @InternalCounter, @DecreaseRate
Uses esi
mov D@Counter 164
mov D@InternalCounter 0
fld1 | fstp F@DecreaseRate
mov esi MyTangTBl
add esi (164*2*8) | sub esi (8*2)
fld R$Float_PI | fstp R$esi
dec D@Counter
sub esi (8*2)
mov D@Value 33 | fild D@Value | fstp F@Value
.Do
fld D@Value | fsub F@DecreaseRate | fst F@Value | fld1 | fpatan | fstp R$esi
If D@InternalCounter = 16
fld F@DecreaseRate | fmul F$Float_half | fstp F@DecreaseRate
mov D@InternalCounter 0
End_If
inc D@InternalCounter
sub esi (8*2)
dec D@Counter
.Loop_Until D@Counter = 0
L1:
EndP
The original values of this table (including the error values - I didn´t included the error values on mine version, btw, but it don´t affect the precision too much) is :
DoubleData <0.02928849741073059, 3.878020342543118e-16>
DoubleData <0.03026419423862503, 2.13439966346682e-16>
DoubleData <0.03123983343026815, 1.27181094510696e-16>
DoubleData <0.03319093149711083, 7.623353902758895e-16>
DoubleData <0.03514177680279662, 1.622534112980482e-16>
DoubleData <0.03709235455039117, 6.433766265361664e-16>
DoubleData <0.03904264995516638, 6.181886837851646e-16>
DoubleData <0.04099264824526294, 8.420830387839661e-16>
DoubleData <0.04294233466236186, 3.142811786546985e-16>
DoubleData <0.04489169446234609, 4.096264921495231e-16>
DoubleData <0.04684071291596936, 2.912679762198781e-16>
DoubleData <0.04878937530951521, 4.053693341011583e-16>
DoubleData <0.05073766694545956, 6.612695435815108e-16>
DoubleData <0.05268557314312972, 3.289966758135281e-16>
DoubleData <0.05463307923935901, 4.691749851066725e-16>
DoubleData <0.05658017058914488, 8.289832696419427e-16>
DoubleData <0.05852683256630176, 1.362951662640667e-17>
DoubleData <0.06047305056410668, 6.37811249269131e-16>
DoubleData <0.06241880999595661, 7.339736781833367e-16>
DoubleData <0.06630889491982295, 5.407451321058335e-16>
DoubleData <0.07019697107187017, 3.451465030350394e-16>
DoubleData <0.07408292254903337, 3.609579312927062e-16>
DoubleData <0.0779666338315419, 4.082603982997626e-16>
DoubleData <0.08184798980307573, 8.205279419748868e-16>
DoubleData <0.08572687577074412, 6.992365845342258e-16>
DoubleData <0.08960317748487157, 1.793304156180366e-16>
DoubleData <0.09347678115858926, 2.018823445176747e-16>
DoubleData <0.09734757348722312, 5.576265782080219e-16>
DoubleData <0.101215441667466, 6.856928051415318e-16>
DoubleData <0.1050802734163288, 7.80192436556187e-16>
DoubleData <0.1089419569898658, 4.846007563068433e-17>
DoubleData <0.1128003812016587, 6.957566319383172e-16>
DoubleData <0.1166554354410687, 6.438661649715736e-16>
DoubleData <0.1205070096912237, 8.182569515647907e-16>
DoubleData <0.1243549945467608, 6.352529150170111e-16>
DoubleData <0.1320397616146387, 4.274189722415823e-17>
DoubleData <0.1397088742891635, 1.913310428846708e-16>
DoubleData <0.1473614810886508, 8.658324432321626e-16>
DoubleData <0.1549967419239406, 3.426523229816613e-16>
DoubleData <0.1626138285979479, 6.461627098025713e-16>
DoubleData <0.1702119252854741, 2.74014592076487e-16>
DoubleData <0.1777902289926754, 6.898598082898684e-16>
DoubleData <0.1853479499956947, 5.969184350010091e-17>
DoubleData <0.1928843122579744, 2.42385590364413e-16>
DoubleData <0.2003985538258783, 1.696734079809579e-16>
DoubleData <0.2078899272022623, 7.012225510572437e-16>
DoubleData <0.215357699697738, 5.59849672442657e-17>
DoubleData <0.2228011537593941, 3.830792364463579e-16>
DoubleData <0.2302195872768431, 6.229360680729788e-16>
DoubleData <0.2376123138654709, 3.71404797316887e-16>
DoubleData <0.2449786631268633, 8.156104484719729e-16>
DoubleData <0.2596296294082574, 1.30261057387131e-16>
DoubleData <0.2741674511196583, 4.523505634252264e-16>
DoubleData <0.2885873618940771, 3.187832078137744e-16>
DoubleData <0.3028848683749708, 6.551229868720925e-16>
DoubleData <0.3170557532091465, 5.361722230696518e-16>
DoubleData <0.3310960767041315, 5.471589019367845e-16>
DoubleData <0.345002177207105, 8.808349770693735e-17>
DoubleData <0.3587706702705722, 3.088733564861919e-17>
DoubleData <0.3723984466767538, 4.081903701236505e-16>
DoubleData <0.3858826693980735, 3.013439834812086e-16>
DoubleData <0.3992207695752521, 4.665551909062331e-16>
DoubleData <0.4124104415973866, 7.057684437286449e-16>
DoubleData <0.4254496373700416, 6.894493455169868e-16>
DoubleData <0.4383365598579569, 8.632356493938598e-16>
DoubleData <0.4510696559885234, 3.280735600183736e-17>
DoubleData <0.4636476090008053, 8.553660459218291e-16>
DoubleData <0.4883339510564051, 4.32715973660733e-16>
DoubleData <0.5123894603107368, 8.627156382272694e-16>
DoubleData <0.5358112379604636, 1.069585067790331e-16>
DoubleData <0.558599315343562, 4.38633579301471e-16>
DoubleData <0.5807563535676703, 9.660765868058498e-17>
DoubleData <0.6022873461349638, 3.62571214759831e-16>
DoubleData <0.6231993299340655, 4.708132487014636e-16>
DoubleData <0.6435011087932843, 1.268570875139599e-16>
DoubleData <0.6632029927060925, 7.463955685933131e-16>
DoubleData <0.6823165548747481, 6.943223671560008e-18>
DoubleData <0.7008544078844494, 7.572798548942514e-16>
DoubleData <0.7188299996216241, 4.226108214056057e-16>
DoubleData <0.7362574289814274, 7.008731912580885e-16>
DoubleData <0.7531512809621939, 5.308545776533963e-16>
DoubleData <0.7695264804056574, 8.51128500644098e-16>
DoubleData <0.7853981633974483, 3.061616997868383e-17>
DoubleData <0.8156919233162228, 6.554192491473065e-16>
DoubleData <0.8441539861131702, 8.397650495807761e-16>
DoubleData <0.8709034570756522, 7.544578813301367e-16>
DoubleData <0.8960553845713433, 6.95372577632837e-16>
DoubleData <0.9197196053504166, 1.814702107965036e-16>
DoubleData <0.9420000403794635, 1.656306773209825e-16>
DoubleData <0.9629943306809361, 1.070356418673049e-16>
DoubleData <0.9827937232473287, 3.46970218418778e-16>
DoubleData <1.001483135694234, 7.605168950105478e-16>
DoubleData <1.019141344266349, 6.761378336444607e-16>
DoubleData <1.0358412530088, 3.194313981784504e-17>
DoubleData <1.051650212548373, 5.696281674604188e-16>
DoubleData <1.066630365315743, 6.065679184034902e-16>
DoubleData <1.080839000541168, 2.063682824136722e-16>
DoubleData <1.09432890732119, 2.165539287700089e-16>
DoubleData <1.10714871779409, 9.404471373566379e-17>
DoubleData <1.13095374397916, 5.153275478954471e-16>
DoubleData <1.152571997215667, 1.304472198360275e-16>
DoubleData <1.172273881128476, 7.499857009153807e-16>
DoubleData <1.190289949682532, 7.683333629842069e-17>
DoubleData <1.206817370285252, 2.637692813136457e-16>
DoubleData <1.222025323210989, 6.363421861261654e-16>
DoubleData <1.236059489478081, 7.449313421696882e-16>
DoubleData <1.249045772398254, 8.859822159005129e-16>
DoubleData <1.26109338225244, 2.544660011403809e-16>
DoubleData <1.272297395208717, 2.245875015034507e-17>
DoubleData <1.28274087974427, 8.788952309458591e-16>
DoubleData <1.292496667789785, 1.537365572357647e-16>
DoubleData <1.301628834009196, 2.09675419926785e-16>
DoubleData <1.310193935047555, 7.535879521228967e-16>
DoubleData <1.318242051016837, 3.808952695386159e-16>
DoubleData <1.325817663668032, 1.3380031118552e-16>
DoubleData <1.339705659598999, 6.401436961720526e-16>
DoubleData <1.352127380920955, 2.147674250751151e-17>
DoubleData <1.363300100359694, 3.313692220777249e-16>
DoubleData <1.373400766945015, 6.330567112173988e-16>
DoubleData <1.382574821490126, 1.86429700538549e-16>
DoubleData <1.390942827002418, 7.897412983652368e-16>
DoubleData <1.398605512271957, 4.208485980241463e-16>
DoubleData <1.40564764938027, 1.328183035426864e-16>
DoubleData <1.412141064608495, 5.703957436695217e-16>
DoubleData <1.418146998399631, 8.055395818750151e-16>
DoubleData <1.423717971406494, 3.09263314147271e-16>
DoubleData <1.428899272190733, 1.574732574926438e-16>
DoubleData <1.433730152484708, 8.442163750324489e-16>
DoubleData <1.438244794498222, 6.412036156724483e-16>
DoubleData <1.442473099109101, 7.776664761570937e-16>
DoubleData <1.446441332248135, 3.141578446404818e-16>
DoubleData <1.453687582228032, 8.199907271986118e-16>
DoubleData <1.460139105621001, 2.846543520033397e-16>
DoubleData <1.465919388064663, 1.24932049384283e-16>
DoubleData <1.471127674303734, 7.79686192079511e-16>
DoubleData <1.47584462045214, 4.779648065777258e-16>
DoubleData <1.480136439594151, 3.07070859685828e-16>
DoubleData <1.484057988118911, 4.096346991714267e-16>
DoubleData <1.487655094906455, 7.844371739461077e-17>
DoubleData <1.490966341082659, 6.221434760020122e-17>
DoubleData <1.494024435525118, 8.134142446723606e-16>
DoubleData <1.496857289136956, 6.830938909876451e-16>
DoubleData <1.499488862009605, 8.873091857396741e-16>
DoubleData <1.501939837493851, 7.267528106158521e-16>
DoubleData <1.504228163019072, 4.532269913842895e-16>
DoubleData <1.506369487369343, 5.605999046418969e-16>
DoubleData <1.508377516798939, 2.154370814741562e-16>
DoubleData <1.512040504079174, 1.402618800554007e-16>
DoubleData <1.515297821549179, 5.368157936821226e-16>
DoubleData <1.518213265183954, 7.375391359310956e-16>
DoubleData <1.520837931072953, 6.825613612540716e-16>
DoubleData <1.523213223517913, 4.501543596256141e-16>
DoubleData <1.52537304737332, 2.482983385700394e-17>
DoubleData <1.527345431403365, 7.934779986221159e-16>
DoubleData <1.529153747696308, 6.760940733854243e-16>
DoubleData <1.530817639671606, 3.549557335150754e-16>
DoubleData <1.532353736773708, 2.956322283362237e-16>
DoubleData <1.533776210920966, 9.377354806572844e-17>
DoubleData <1.535097214115572, 7.767503702335957e-16>
DoubleData <1.536327225795388, 8.864446875291863e-16>
DoubleData <1.537475330916649, 8.116858600761242e-17>
DoubleData <1.538549444359642, 5.614707476320801e-16>
DoubleData <1.539556493364628, 8.22229665146797e-16>
DoubleData <1.570796326794897, 6.123233995736766e-17>
The source is embedded in the dll but i can post it here later if the tests are ok. It shouldn´t be hard to port to masm, if someone is interested to work further with the function.
As the original version it can only work with 1 calculated value at once. (Double). But, i presume it could be optimized to handle 4 Floats at once to gain even more speed.
Btw, to convert it to degree you can also use this:
[Float_AtanPiFactor: R$ (180/3.1415926535897932384626433832795)]
[Float360: R$ 360]
[Float_Zero: R$ 0]
Proc SSE_Atan2Double:
Arguments @pY, @pX, @ConvDegree
Uses eax, ebx, ecx
mov ebx D@pY
mov ecx D@pX
movupd XMM0 X$ebx
movupd XMM1 X$ecx
;call 'ucrtbase.__libm_sse2_atan2' <---original version inside ucrtbase.dll
call Sse2_atan2 <---mine tests.
.If D@ConvDegree = &TRUE ; <--- Flag to convert to degree (True/False)
movupd XMM1 X$Float_AtanPiFactor
mulpd XMM0 XMM1
xorpd xmm2 xmm2
SSE_D_If xmm0 < xmm2 ; A simple macro to handle sse2 comparisons. It is simply: COMISD XMM0 XMM2 | JNB D0> So, it is comisd followed by the corresponding jmp to D0 (Our Else_IF macro)
addpd xmm0 X$Float360
SSE_D_Else_If xmm0 > X$Float360 ; same as above: JMP D1> | D0:COMISD XMM0 X$FLOAT360 | JNA D0>
subpd xmm0 X$Float360
SSE_D_End_If ; The ending of the jumps D0: D1:
.End_If
mov eax &TRUE
EndP
Example of usage:
[TestingAtan2Data1: R$ 0.5317094316614787480759158718400589803054643151426570508592559650]
[TestingAtan2Data2: R$ 1]
call SSE_Atan2Double TestingAtan2Data1, TestingAtan2Data2, &TRUE
It will return in ST0 = 28º
Quote from: guga on February 08, 2022, 11:10:30 PM
Can someone test the timmings for this please ?
It's a DLL, Gustavo... how are we supposed to test it?
The same way as you did for the FastMath. Didn´t you used a external dll to call the function ? The benchmark apps can handle external dlls, right ?
something like
importlib test.lib
invoke Sse2_atan2
But, I can try disassemble it and create a source compatible to masm, if needed.
Please, see if this helps (It´s the disassembled source)
Quote from: guga on February 09, 2022, 05:13:51 AM
The same way as you did for the FastMath. Didn´t you used a external dll to call the function ? The benchmark apps can handle external dlls, right ?
something like
importlib test.lib
invoke Sse2_atan2
It can, Guga, but I need a minimum of documentation. Something like atan2=Sse2_atan2(double X, double Y)?
For now,
Dll "GugaAtan2"
Declare Sse2_atan2, 2
fails miserably saying "can't find module Sse2_atan2" :sad:
Working here! (using source code)
No timing yet
erased a minor detail :biggrin:
Quote from: jj2007 on February 09, 2022, 06:04:39 AM
Quote from: guga on February 09, 2022, 05:13:51 AM
The same way as you did for the FastMath. Didn´t you used a external dll to call the function ? The benchmark apps can handle external dlls, right ?
something like
importlib test.lib
invoke Sse2_atan2
It can, Guga, but I need a minimum of documentation. Something like atan2=Sse2_atan2(double X, double Y)?
For now,
Dll "GugaAtan2"
Declare Sse2_atan2, 2
fails miserably saying "can't find module Sse2_atan2" :sad:
That´s weird. The exported function i named as "Sse2_atan2". Are you sure you are importing test.dll ? Because the name of the dll is not "GugaAtan2"
Shouldn´t it be ?
Dll "test"
Declare Sse2_atan2,
1The ordinal value of the function inside the dll is 1 and not 2
Quote from: HSE on February 09, 2022, 07:23:02 AM
Working here!
No timing yet
erased a minor detail :biggrin:
Tks HSE :thumbsup: :thumbsup: :thumbsup:
Quote from: jj2007 on February 09, 2022, 06:04:39 AM
Quote from: guga on February 09, 2022, 05:13:51 AM
The same way as you did for the FastMath. Didn´t you used a external dll to call the function ? The benchmark apps can handle external dlls, right ?
something like
importlib test.lib
invoke Sse2_atan2
It can, Guga, but I need a minimum of documentation. Something like atan2=Sse2_atan2(double X, double Y)?
For now,
Dll "GugaAtan2"
Declare Sse2_atan2, 2
fails miserably saying "can't find module Sse2_atan2" :sad:
Sorry, i forgot to answer this
"Sse2_atan2(double X, double Y)?"
Yes, the contents of xmm1 and xmm0 are doubles.(The same as in the original library - ucrtbase.dll). The function does not use any parameter, but you must feed xmm0 and xmm1 registers with double values.
Internally i rewrote the function as this:
; The table was generated as CreateTangetTable
; I rewrote/reorganized the whole function to avoid heavy usage of spaghetti code as in the original version inside ucrtbase.dll
Proc Sse2_atan2::
movlpd X$SSEAtanTmpVal1 xmm0 ; <-----Temporary global variable to store the data to be used in the internal functions
movlpd X$SSEAtanTmpVal2 xmm1 ; <-----Temporary global variable to store the data to be used in the internal functions
pextrw eax xmm0 3 | and eax 07FF0 | sub eax 03870
pextrw ecx xmm1 3 | and ecx 07FF0 | sub ecx 03870
.If_And eax <= 0F00, ecx <= 0F00
call Sse2_atan2Internal4
.Else
pextrw ecx xmm0 3 | and ecx 07FF0
pextrw eax xmm1 3 | and eax 07FF0
If ecx = 07FF0
call Sse2_atan2Internal1
Else_If eax = 07FF0
call Sse2_atan2Internal2
Else
call Sse2_atan2Internal5
End_If
.End_If
EndP
Quote from: guga on February 09, 2022, 07:30:32 AMDll "test"
Declare Sse2_atan2, 1
The ordinal value of the function inside the dll is 1 and not 2
I changed the name of the DLL to avoid ambiguities. Declare (https://www.jj2007.eu/MasmBasicQuickReference.htm#Mb1017) expects the number of args, but I found out debugging your DLL that you pass the two arguments in xmm0 and xmm1 (which is pretty unorthodox).
So, here are results for Sse2_atan2(1.0, 2.0) - as compared to ucrtbase atan2:
Intel(R) Core(TM) i5-2450M CPU @ 2.50GHz (SSE4)
6472 cycles for 100 * Sse2_atan2
17801 cycles for 100 * ucrtbase atan2
6443 cycles for 100 * Sse2_atan2
17799 cycles for 100 * ucrtbase atan2
6424 cycles for 100 * Sse2_atan2
17810 cycles for 100 * ucrtbase atan2
6423 cycles for 100 * Sse2_atan2
17841 cycles for 100 * ucrtbase atan2
123 bytes for Sse2_atan2
127 bytes for ucrtbase atan2
Real8 -56.30993247402027180 Sse2_atan2
Real8 -56.30993247402021495 ucrtbase atan2
The two calls are as follows:
Dll "GugaAtan2"
Declare Sse2_atan2, 0 ; 0 arguments because they are passed in xmm0 and xmm1
Dll "ucrtbase"
Declare atan2, C:2 ; 2 arguments, C calling convention
...
movlps xmm0, FP8(-33.0)
movlps xmm1, FP8(22.0)
void Sse2_atan2()
void atan2(FP8(-33.0), FP8(22.0))
See TestA and TestB for differences in the way the two algos return their results.
Hi JJ. Many tks. It seems pretty fast :azn: :azn: :azn:
I know it is an unorthodox way, but i just wanted to test the functionality on the exact way as it is in the original dll (ucrtbase). To use regular parameters, i created another function to test like this:
[Float_AtanPiFactor: R$ (180/3.1415926535897932384626433832795)]
[Float360: R$ 360]
[Float_Zero: R$ 0]
pY(Input) = Pointer to a Double (Real8) value
pX(input) = Pointer to a Double (Real8) value
ConvDegree - A Flag to convert the output to degree or radian. True = convert to degree. Radian, otherwise
Proc SSE_AtanEx:
Arguments @pY, @pX, @ConvDegree
Uses eax, ebx, ecx, edx
mov ebx D@pY
mov ecx D@pX
movupd XMM0 X$ebx
movupd XMM1 X$ecx
call Sse2_atan2 ;<------ call the main function
.If D@ConvDegree = &TRUE
movupd XMM1 X$Float_AtanPiFactor
mulpd XMM0 XMM1
movupd XMM1 X$Float360
xorpd xmm2 xmm2
SSE_D_If xmm0 < xmm2 ; Simple SSE comparition macro as explained earlier. COMISD XMM0 XMM2 | jnb D0> ......
addpd xmm0 xmm1
SSE_D_Else_If xmm0 > xmm1 ; Same as above
subpd xmm0 xmm1
SSE_D_End_If ; endpoint of the comparition macros. Those macros works the same as the regular If/Else/Else_If, but focused to SSE registers.
.End_If
mov eax &TRUE
EndP
Btw..later both functions can be joined to allow the usage of parameters on a single function, but i wanted to test the speed and accuracy of this routine 1st.
Quote from: jj2007 on February 09, 2022, 08:10:55 AM
Quote from: guga on February 09, 2022, 07:30:32 AMDll "test"
Declare Sse2_atan2, 1
The ordinal value of the function inside the dll is 1 and not 2
I changed the name of the DLL to avoid ambiguities. Declare (https://www.jj2007.eu/MasmBasicQuickReference.htm#Mb1017) expects the number of args, but I found out debugging your DLL that you pass the two arguments in xmm0 and xmm1 (which is pretty unorthodox).
So, here are results for Sse2_atan2(1.0, 2.0) - as compared to ucrtbase atan2:
Intel(R) Core(TM) i5-2450M CPU @ 2.50GHz (SSE4)
6405 cycles for 100 * Sse2_atan2
35327 cycles for 100 * ucrtbase atan2
6407 cycles for 100 * Sse2_atan2
35314 cycles for 100 * ucrtbase atan2
6420 cycles for 100 * Sse2_atan2
35373 cycles for 100 * ucrtbase atan2
6426 cycles for 100 * Sse2_atan2
35292 cycles for 100 * ucrtbase atan2
105 bytes for Sse2_atan2
108 bytes for ucrtbase atan2
Real8 0.4636476090008099793 Sse2_atan2
Real8 1.570796326794896558 ucrtbase atan2
Results are different, so I probably got some arguments wrong :sad:
About the results being different, maybe it is reusing xmm0 and xmm1 when the functions are called.
Here it works ok if i only do this:
[TestingAtan2Data1: R$ 1.0]
[TestingAtan2Data2: R$ 2.0]
movupd XMM0 X$TestingAtan2Data1
movupd XMM1 X$TestingAtan2Data2
call 'ucrtbase.__libm_sse2_atan2'
movupd XMM0 X$TestingAtan2Data1
movupd XMM1 X$TestingAtan2Data2
call Sse2_atan2
xmm0 = 4.63647609000806093e-1 (ucrtbase)
xmm0 = 4.63647609000809979e-1 (mine version).
The small difference is due to the rounding error that i removed while i was testing the functionality. If you use the complete tangent table, then it should generate the same result. Since, i´m still testing it, i´m unsure how this whole thing works exactly.
I made some tests and placed the results of the tangent table on excel and it seems to be a sort of rotating table that the function is using. I´m trying to find if there is a direct relation between the table values and the resultant data, to see if the speed can be improved further, simply making pointers to the table and some small calculations. But, i still have no idea exactly how this function really works.
Impressive :thumbsup:
Intel(R) Core(TM) i3-10100 CPU @ 3.60GHz (SSE4)
6783 cycles for 100 * Sse2_atan2
11995 cycles for 100 * fpatan
6783 cycles for 100 * Sse2_atan2
11999 cycles for 100 * fpatan
6798 cycles for 100 * Sse2_atan2
12005 cycles for 100 * fpatan
6786 cycles for 100 * Sse2_atan2
11989 cycles for 100 * fpatan
29 bytes for Sse2_atan2
20 bytes for fpatan
Real8 0.4636476090008099793 Sse2_atan2
Real8 0.4636476090008060935 fpatan
This include result storage.
Tks HSE :azn: :azn: :azn:
Sorry for the syntax. I´m used to RosAsm that´s why i have some difficulties in port the code, but it seems to work as expected with minor differences then the original version. You can rewrite the code as you wish.I believe there´s a way to make it even faster, but it is necessary to understand what exactly the whole function is doing. It uses a table of tangents created on a certain way, displaced on 164 values, with a separation of 16 "indexes". I don´t know why it is using only 164 indexes rather then the 176 = 11 indexes *16 (closer to 180º). In excel there seems to be a sort of relation between the indexes and the resultants value, but i´m not sure yet.
Once the whole function is fully understand then, perhaps optimization could be done, forcing the function to stay in only 1 quadrant (90º) and identifying the changes of sign in other quadrants and simply add or subtract 360º from the result. Don´t know yet.
There are other functions inside ucrtbase that seems really good, but the problems is that the way they were written kills speed due to heavy spaghetti code. On the original libm_sse2_atan2, it contains jumps that go back and forward the entire function, and also some pointers to other frames (other functions) as well. What i did was reorganize the whole code and create the proper sub-functions avoiding the heavy usage of spaghetti code inside.
Hi Guga,
It seems you had an earlier version (I edited the post). Here is the latest, with HSE's fpatan addition:
Intel(R) Core(TM) i5-2450M CPU @ 2.50GHz (SSE4)
6923 cycles for 100 * Sse2_atan2
18360 cycles for 100 * ucrtbase atan2
11834 cycles for 100 * fpatan
6434 cycles for 100 * Sse2_atan2
18435 cycles for 100 * ucrtbase atan2
11476 cycles for 100 * fpatan
6430 cycles for 100 * Sse2_atan2
17775 cycles for 100 * ucrtbase atan2
11868 cycles for 100 * fpatan
6429 cycles for 100 * Sse2_atan2
17821 cycles for 100 * ucrtbase atan2
11493 cycles for 100 * fpatan
123 bytes for Sse2_atan2
127 bytes for ucrtbase atan2
110 bytes for fpatan
Real8 -56.30993247402027180 Sse2_atan2
Real8 -56.30993247402021495 ucrtbase atan2
Real8 -56.30993247402021495 fpatan
Tks a lot, JJ
I´ll take a look. There is a small difference in the results, but it´s due to the rounding error that i removed earlier.
I´m trying to understand the math behind this, and why the function is using things like:
pextrw eax xmm0 3 | and eax 07FF0 | sub eax 03870
...
mov eax 0800 | pinsrw xmm5 eax 3
etc etc.
before it reaches what is really necessary to compute the atan2, i mean, before it do things like divsd xmm4 xmm1 etc.
If i could understand what´s going on before it reaches the SSe division, then maybe we could gain more speed, simply making the division of xmm0 / xmm1 at 1st and then do minor calculations before it reaches the tangent table to later adjust the remainder value.
For me, it´s a bit weird it´s using an error value in the 2nd Real8 on the tangent table, because the values of the 1st and 2nd Real8 are complementary, so, it seems useless calculating all of this feeding error values. What´s also weird is that the original version from ucrtbase and fpatan produces some smaller differences then the values found in wolfram-alpha, for example.
Btw...follow the results i´ve got in your last version
:dazzled: :dazzled: :dazzled: :dazzled:
AMD Ryzen 5 2400G with Radeon Vega Graphics (SSE4)
6931 cycles for 100 * Sse2_atan2
69442 cycles for 100 * ucrtbase atan2
14871 cycles for 100 * fpatan
7411 cycles for 100 * Sse2_atan2
69354 cycles for 100 * ucrtbase atan2
14886 cycles for 100 * fpatan
6994 cycles for 100 * Sse2_atan2
68109 cycles for 100 * ucrtbase atan2
14957 cycles for 100 * fpatan
7040 cycles for 100 * Sse2_atan2
67606 cycles for 100 * ucrtbase atan2
14731 cycles for 100 * fpatan
123 bytes for Sse2_atan2
127 bytes for ucrtbase atan2
110 bytes for fpatan
Real8 -56.30993247402027180 Sse2_atan2
Real8 -56.30993247402021495 ucrtbase atan2
Real8 -56.30993247402021495 fpatan
--- ok ---
Now i can try to make a small test on the routine i was doing for marinu´s Fir Algo and FFT Analyser to retrieve the value of phase of FFT Frequencies :)
Quote from: guga on February 09, 2022, 09:42:00 AM
then maybe we could gain more speed
The reason that SSE algorithm is fast than FPU is because have less precision. While FPU use 8-9 degree polinomies, SSE algorithms are using 5-6 degrees (if I remeber well some discussion somewhere in this forum).
If you can accept less precision perhaps You can use a faster algorithm.
Then perhaps You can build a hyperfaster low precision library for previews, and a best precision library for final works.
I don´t know if the precision is less accurate. On the original version It seems to produce the same result as in FPU. I finished some minor optimizations both for speed and accuracy. Now it seems to result the same value as the original one, and perhaps a bit more fast.
The function (and the original) uses an approximation that seems to be biased in:
https://www.dsprelated.com/showarticle/1052.php
But, it works better, since it is more accurate then the results of the link. It is using a rotating table of tangents. On my last version i removed several useless SSE checks, including the ones at the start of the function, replaced them with the regular opcodes counterpart (mov eax etc) and forced the routines to simply point to the proper masks. I´m currently trying to understand how exactly one of the internal function works, "Sse2_atan2Internal4", because this is where the table of tangents are being used. It gets a sort of index to calculate the value, and maybe we can create a similar binary search function as JJ did sometime ago to point to the indexes of the table or something like that.
I´m currently labeling the variables and forcing the function to use the proper parameters rather then xmm0/xmm1 directly and also creating another function to retrieve the resul in Angles too. Once i finish i´ll upload it here to we perform the benchmarks. :smiley: :smiley:
Quote from: guga on February 10, 2022, 02:09:47 AM...table of tangents are being used. It gets a sort of index to calculate the value, and maybe we can create a similar binary search function as JJ did sometime ago to point to the indexes of the table or something like that.
Quote from: jj2007 on February 07, 2022, 09:54:42 AMIf your problem is to find atan2(X,Y), and either X or Y is a constant, then FastMath (http://masm32.com/board/index.php?topic=8779.0) would be an option.
Tks, JJ.
I couldn´t understand how to test FastMath. Also, isn´t it using the windows dll (crt or ucrtbase) underneath ?
Btw...here goes a new version. I made some modifications as you suggested and make it work with regular parameters. One thing to consider when assembling this. The global variables Atan2_Mask1 to Atan2_Mask12 must be aligned to a 16 byte-boundary, otherwise the app will crash. This is because they are being used as variables to things like minsd, andpd etc
I added 4 functions (Renamed the previous function . "D" for double for easier comprehension):
SSE_Atan2DEx: Retrieves the atan2 of a value and also can convert the result to Degree or Radian.
Uses 3 parameters:
pY - Pointer to a Real8 value
pX - Pointer to a Real8 value
ConvDegree- A Flag (integer) to convert the result to degrees. TRUE = Degrees. FALSE = Radians.
Return Value: The resultant value is located in xmm0 register
SSE_Atan2D: Retrieves the atan2 of a value.
Uses 2 parameters:
pY - Pointer to a Real8 value
pX - Pointer to a Real8 value
Return Value: The resultant value is located in xmm0 register (In radians)
Atan2: Retrieves the atan2 of a value and also can convert the result to Degree or Radian. Same as SSE_Atan2DEx, but using FPU rather then SSE2
Uses 3 parameters:
pY - Pointer to a Real8 value
pX - Pointer to a Real8 value
ConvDegree- A Flag (integer) to convert the result to degrees. TRUE = Degrees. FALSE = Radians.
Return Value: The resultant value is located in ST0 register
CreateTangetTable This was for my tests only. It generates a table of tangents used in SSE_Atan2DEx function. The function uses no parameters, and the resultant values are stored in MyTangTBl variable
Typo:
The SSE_Atan2D uses 2 parameters and not 3 as i wrote in the previous post. Post now edited :smiley:
Quote from: guga on February 10, 2022, 09:47:29 AM
I couldn´t understand how to test FastMath. Also, isn´t it using the windows dll (crt or ucrtbase) underneath ?
No, it doesn't use crt or similar.
Quote from: jj2007 on February 07, 2022, 09:54:42 AMIf your problem is to find atan2(X,Y), and either X or Y is a constant, then FastMath (http://masm32.com/board/index.php?topic=8779.0) would be an option.
But FastMath is limited to one argument functions, that's why I asked what your usage of that function is. Anyway, I attach a new version that includes FastMath with a fixed second argument:
Intel(R) Core(TM) i5-2450M CPU @ 2.50GHz (SSE4)
6699 cycles for 100 * Sse2_atan2
18672 cycles for 100 * ucrtbase atan2
11874 cycles for 100 * fpatan
1568 cycles for 100 * FastAtan
6511 cycles for 100 * Sse2_atan2
18258 cycles for 100 * ucrtbase atan2
11638 cycles for 100 * fpatan
1426 cycles for 100 * FastAtan
6581 cycles for 100 * Sse2_atan2
18033 cycles for 100 * ucrtbase atan2
11706 cycles for 100 * fpatan
1477 cycles for 100 * FastAtan
6574 cycles for 100 * Sse2_atan2
18159 cycles for 100 * ucrtbase atan2
11986 cycles for 100 * fpatan
1450 cycles for 100 * FastAtan
123 bytes for Sse2_atan2
127 bytes for ucrtbase atan2
106 bytes for fpatan
269 bytes for FastAtan
Real8 -56.30993247402027180 Sse2_atan2
Real8 -56.30993247402021495 ucrtbase atan2
Real8 -56.30993247402021495 fpatan
Real8 -56.30993247402021495 FastAtan
Here is the FastMath usage, first the creation (once):
FastMath FastAtan ; define a math function
For_ fct=-40.0 To 10.0 Step 0.5 ; max 10,000 iterations
fld fct ; X
fstp REAL10 ptr [edi]
fld fct
fld FP8(22.0)
fpatan ; Masm32 algo
fstp REAL10 ptr [edi+REAL10]
add edi, 2*REAL10
Next
FastMath
And here the usage in the loop, see TestD:
mov ebx, AlgoLoops-1 ; loop e.g. 100x
align 4
.Repeat
void FastAtan(-33.0)
dec ebx
.if !Sign?
fstp st
.endif
.Until Sign?
Timings for the creation of the FastMath table (it could be done many times in an outer loop for the second argument):
39 µs for creating FastMath
28 µs for creating FastMath
23 µs for creating FastMath
26 µs for creating FastMath
23 µs for creating FastMath
25 µs for creating FastMath
23 µs for creating FastMath
25 µs for creating FastMath
23 µs for creating FastMath
23 µs for creating FastMath
WOW !
Tks JJ. But, i think you used the old version. The new one (That uses parameters) is the one attached previously (also attached here). The name of the function is SSE_Atan2DEx (Uses 3 parameters to convert to degree) or SSE_Atan2D (Using only 2 parameters. No degree conversion)
I tried to assemble your file with RichMasm, but got this msg
*** Start C:\masm32\MasmBasic\Res\bldallRM.bat ***
*** 32-bit assembly ***
*** Assemble, link and run GugaAtan2 ***
*** Assemble using ml ***
Assembling: tmp_file.asm
##########################
You CANNOT use the MasmBasic library with
the old ml.exe version 6.14 - use UAsm instead
##########################
tmp_file.asm(13) : error A2008: syntax error : StackWalk
*** MasmBasic version 18.02.2021 ***
*** Warning: SQWORD is unsigned with ml **
\masm32\MasmBasic\MasmBasic.inc(728) : error A2052: forced error
TestMasmVersion(8): Macro Called From
\masm32\MasmBasic\MasmBasic.inc(728): Include File
*** Assembly error ***
I do have uasm installed in the masm directory "C:\masm32\bin\uasm64.exe". How do i configure richmasm to assemble the file using uasm, so i can try to assemble the file as well ?
Quote from: guga on February 11, 2022, 02:00:17 AMI tried to assemble your file with RichMasm, but got this msg
Oops, my fault:
- comment out StackWalk (http://masm32.com/board/index.php?topic=9803.msg107496#msg107496) in line 13 (that's a new macro, not yet included in the current MasmBasic version)
- use OPxT_Assembler at the end of the file (the x disables the option; your ml.exe is the old 6.14)
New version below. Creating the FastMath table for a new second argument costs 12 microseconds on average. You still have not told us how you will use this; since you are interested in speed, I suppose you have an innermost plus some outer loop.
Can the outer loop work with a 12 microseconds delay?12 us on average for creating FastMath
Intel(R) Core(TM) i5-2450M CPU @ 2.50GHz (SSE4)
6417 cycles for 100 * SSE_Atan2D_new
18184 cycles for 100 * ucrtbase atan2
11387 cycles for 100 * fpatan
1438 cycles for 100 * FastAtan
6273 cycles for 100 * SSE_Atan2D_new
18093 cycles for 100 * ucrtbase atan2
11382 cycles for 100 * fpatan
1398 cycles for 100 * FastAtan
6171 cycles for 100 * SSE_Atan2D_new
18259 cycles for 100 * ucrtbase atan2
11396 cycles for 100 * fpatan
1398 cycles for 100 * FastAtan
6378 cycles for 100 * SSE_Atan2D_new
18089 cycles for 100 * ucrtbase atan2
11382 cycles for 100 * fpatan
1398 cycles for 100 * FastAtan
49 bytes for SSE_Atan2D_new
51 bytes for ucrtbase atan2
30 bytes for fpatan
193 bytes for FastAtan
Real8 -56.30993247402021495 SSE_Atan2D_new
Real8 -56.30993247402021495 ucrtbase atan2
Real8 -56.30993247402021495 fpatan
Real8 -56.30993247402021495 FastAtan
Hi JJ.
Many Tks. I´ll take a look
(Post edited), i didn´t saw the file you uploaded when i was writing this comment.
My results with the new version. A bit faster and accurated :azn: :azn:
31 us on average for creating FastMath
AMD Ryzen 5 2400G with Radeon Vega Graphics (SSE4)
5123 cycles for 100 * SSE_Atan2D_new
67818 cycles for 100 * ucrtbase atan2
14832 cycles for 100 * fpatan
1479 cycles for 100 * FastAtan
5166 cycles for 100 * SSE_Atan2D_new
67255 cycles for 100 * ucrtbase atan2
14665 cycles for 100 * fpatan
1523 cycles for 100 * FastAtan
5029 cycles for 100 * SSE_Atan2D_new
68465 cycles for 100 * ucrtbase atan2
15228 cycles for 100 * fpatan
1766 cycles for 100 * FastAtan
5076 cycles for 100 * SSE_Atan2D_new
67520 cycles for 100 * ucrtbase atan2
14604 cycles for 100 * fpatan
1568 cycles for 100 * FastAtan
49 bytes for SSE_Atan2D_new
51 bytes for ucrtbase atan2
30 bytes for fpatan
193 bytes for FastAtan
Real8 -56.30993247402021495 SSE_Atan2D_new
Real8 -56.30993247402021495 ucrtbase atan2
Real8 -56.30993247402021495 fpatan
Real8 -56.30993247402021495 FastAtan
Btw, i´m impressed about the results of FastAtan. I analysed it in the disassembler and got a bit clueless how it is showing the correct result since it uses only 1 parameter (value: 33) and not the other one. The FastMath library is being applied before the computations, right ? Because at
.text:00401201 UsedInFastAtan: ; CODE XREF: MainRoutine+418↓j
.text:00401201 fld tbyte_408068 <- 0.22 = 22/100 (Our 1st parameter Y) ?
.text:00401207 fld tbyte_409992
.text:0040120D faddp st(1), st
.text:0040120F fstp tbyte_409992
.text:00401215 push ecx
.text:00401216 fxsave dword_409B50
.text:0040121D push 0 ; dwMilliseconds
.text:0040121F call Sleep
.text:00401224 push offset PerformanceCount ; lpPerformanceCount
.text:00401229 call QueryPerformanceCounter
.text:0040122E fxrstor dword_409B50
.text:00401235 pop ecx
.text:00401236 push ecx
.text:00401237 push edi
.text:00401238 push 0C3500h ; dwBytes
.text:0040123D push 4 ; dwFlags
.text:0040123F push hHeap ; hHeap
.text:00401245 call HeapAlloc
.text:0040124A mov edi, eax
.text:0040124C mov dword_4098E0, edi
.text:00401252 push edi
.text:00401253 add edi, 61A80h
.text:00401259 push edi
.text:0040125A ffree st(7)
.text:0040125C fld tbyte_40807C
.text:00401262 ffree st(7)
.text:00401264 fld st
.text:00401266 fstp tbyte_409988
.text:0040126C fld tbyte_408072
.text:00401272 fsubrp st(1), st
.text:00401274 fld tbyte_408086
.text:0040127A fdivp st(1), st
.text:0040127C mov edx, offset dword_4099A4
.text:00401281 fmul dbl_408090
.text:00401287 fisttp dword ptr [edx]
.text:00401289 fld tbyte_408086
.text:0040128F fimul dword ptr [edx]
.text:00401291 fld tbyte_409988
.text:00401297 faddp st(1), st
.text:00401299 fstp tbyte_4099FA
.text:0040129F cmp dword_4099A4, 0
.text:004012A6 js short loc_4012EA
.text:004012A8
.text:004012A8 loc_4012A8: ; CODE XREF: MainRoutine+2C4↓j
.text:004012A8 fld tbyte_409988
.text:004012AE fstp tbyte ptr [edi]
.text:004012B0 fld tbyte_409988
.text:004012B6 fld tbyte_409992
.text:004012BC fpatan <---------------Prwcalculated
if i understood correctly, all the computation also uses fpatan opcode, but it is precalculated before the calculations of the timming. So, for the resultant timming be more acccurated, wouldn´t be necessary to add to the final timming of "void FastAtan(-33.0)", the timming it took to precalculate the data. I.e: the fpatan timming as well ?
About the usage. A faster Atan2 is needed to compute more data related to a FFT from audio samples. I'm testing Marinus FirAnalyser and FFTAnalyser in order to create a app that can export ta audio filtered with Fir algorithm or other sort of algorithm that uses FFT too.
http://masm32.com/board/index.php?topic=4231.135
http://masm32.com/board/index.php?topic=3568.msg39209#msg39209
http://masm32.com/board/index.php?topic=3665.15
I´m currently trying to finish a routine to save/export an audio filtered with Marinu´s Fir Algorithm. Later i´ll try to do a similar spectrogram as he did using FFT routines, but instead only calculating the Real or Imaginary frequencies, i wanted to export everything related to an audio file/sample, such as amplitude, phase (that uses atan2 function), pitch identification, Sound Intensity, Sound Power level and finally estimate the distance of the sound and the microphone that recorded it and created the sample (estimating the distance of each frequency can be handy to isolate them or identify backgrounds etc etc).
So, input = audio sample
Output = A structure containing everything related to that sample (or sequence of samples).
Doing that, with all information related to a certain audio file, we can build better tools to identify pitch, noise, isolating vocals etc etc using techniques similar to facebook demucs, or facebook wave2text etc etc. Or also creating a tool that can act as a photoshop, but for audio. example, once we know the frequencies the generated data is placed on a Spectrogram (such as the one existent in Isotope RX using Mel Scale), then we can simply work on the audio file on the same way we do for images. We can identify vocals using only the generated image rather then the heavy computations that involves audio etc
His FFT analyzer has a nice spectrogram but i wanted something like this
(https://imgkub.com/images/2022/02/11/Untitled.th.jpg) (https://imgkub.com/image/Jz0Pp)
Isotope is an extraordinary audio editing tool, but lack some features, such as phase identification, and sound separation (such as facebook´s one - demucs) etc etc
Quote from: guga on February 11, 2022, 04:41:31 AMif i understood correctly, all the computation also uses fpatan opcode, but it is precalculated before the calculations of the timming. So, for the resultant timming be more acccurated, wouldn´t be necessary to add to the final timming of "void FastAtan(-33.0)", the timming it took to precalculate the data. I.e: the fpatan timming as well ?
"31 µs on average for
creating FastMath"
That's why I asked (repeatedly) what your requirements are. If there are two nested loops, with X as the first and Y as the second counter, then the outer loop could re-create the FastMath table (at 12µs on my, 30µs on yours), while the inner loop just uses it.
Hi Guys
I succeeded to find the mathematical equation behind the Atan2 function. It uses a variation of a Taylor Series on the 4th degree, and seems to use some fractions derived from a polynomial calculation (Chebyshev ?)
The polynomial used is this:
0.9980475428149416 * (Y^9)/9 - 0.9999987497055115 * (Y^7)/7 + 0.9999999997101690 * (Y^5)/5 - 0.9999999999999822 * (Y^3)/3 + Y
Where Y = A value resultant from the division of xmm0/xmm4
The divisor is calculated as:
-k + (1 + k^2)/(k + z)
where
k = X/Y = Inputed XValue / Inputed YValue
z = The position on the table that decrements each 16 values. This "position" is, in fact a arctangent of a value starting from 32 to 0.029296875000 (From bottom to top on the table)
I´m making some tests using JJ´s technique he did sometime ago on the binary search algo.
I´ll later try to make further tests before implement a newer routine to check for speed. So, far, i succeeded to remove several repeated usage of xmm registers. I´ll try to limit the number of registers, once i suceed to decode this Taylor series and see if it will work on all situations.
Hi Guys
I succeeded to simplify the math , but i have no idea, why the original version is faster then the one i´m testing. Since it is using less registers and i removed some opcodes, it should be faster. But, for some reason, the original one, is achieving around 5000 clocks only, while the other version i´m testing has only 6500 clocks, even using less registers. Both results are ok, but, what i´m doing wrong ? Why this new version is slower even using less registers and less opcodes ? The only difference in the new version is on a routine, i called Sse2_atan2_ComputePolynomial. Is there a cache issue in SSE2 registers that i´m, not aware ?
Btw: I also tested with JJ´s Binary search routine, but it was unsuccessful. The speed results is the same as he version that does not use the loop.
The original version of thee routine is this:
Proc Sse2_atan2_ComputePolynomial:
Arguments @pY, @pX
Local @StartBlockIndexX, @StartBlockIndexY, @MaskedIndex
Uses ecx
mov eax D@pY | movupd xmm0 X$eax; TestingAtan2Data1
mov eax D@pX | movupd xmm1 X$eax; TestingAtan2Data2
unpcklpd xmm0 xmm1 ; xmm1 = SSEAtanTmpVal2
movmskpd eax xmm0 | add eax eax | mov D@MaskedIndex eax
xorpd xmm3 xmm3
movupd xmm5 X$Atan2_DivPower895 ; 1/2^895 same as: xorpd xmm5 xmm5 | mov eax 0800 | pinsrw xmm5 eax 3
paddw xmm5 xmm1
psrlq xmm5 29
rcpss xmm3 xmm5
psllq xmm3 29
paddw xmm3 X$Atan2_DivPower127 ; xmm3+(1/(2^127)) Warning. The data at this variable must be aligned to 16 bytes
mulsd xmm3 xmm0 ; Multiply this result with the unpacked data from SSEAtanTmpVal2
paddd xmm3 X$Atan2_Mask4
andpd xmm3 X$Atan2_Mask3 ; get the value of our table = tan(ValueUnknown) = 1.125 etc
movsd xmm5 xmm3 ; xmm5 used to get the index for out tangent table
mov eax 08000 | pextrw eax xmm3 3 | and eax 07FF0 | sub eax 03F9E | mov D@StartBlockIndexX eax | add eax 942 | mov D@StartBlockIndexY eax
minsd xmm3 X$Atan2_Float_32
SSE_ABS_REAL8 xmm0 ; Convert to positive. This macro is the same as below
;psllq xmm0 1
;psrlq xmm0 1 ; sign bit of pY was removed (IF any)
cmplesd xmm5 X$Atan2_Float_32
SSE_ABS_REAL8 xmm1 ; Convert to positive. This macro is the same as below
; psllq xmm1 1 ; sign bit of pX was removed (IF any)
; psrlq xmm1 1 ; contains the value of pX TESTINGATAN2DATA2
movsd xmm6 xmm1 ; xmm6 = pX
movsd xmm7 xmm1 ; xmm7 = pX
movsd xmm4 xmm0 | mulsd xmm4 xmm3 | andpd xmm1 xmm5 | addsd xmm4 xmm1
mov ecx 0 | pinsrw xmm6 ecx 0 | subsd xmm7 xmm6
mulsd xmm6 xmm3 | mulsd xmm7 xmm3 | andpd xmm0 xmm5 | subsd xmm0 xmm6
subsd xmm0 xmm7
; TangentTbl2
.If D@StartBlockIndexX <= 1121 ; used xmm4, xmm0, xmm5, xmm3
; Get tangent index and put in ecx
mov eax D@StartBlockIndexX | pextrw eax xmm5 0 | not eax | and eax 1
mov ecx D@StartBlockIndexY | pextrw ecx xmm3 3 | sub ecx 03F9E | add ecx eax | add ecx ecx
movupd xmm5 X$TangentTbl+ecx*8
mov eax D@MaskedIndex
movupd xmm6 X$Atan2_P_TBL+eax*8
movupd xmm1 X$Atan2_SGN_TBL+eax*8
xorpd xmm5 xmm1 | addpd xmm5 xmm6
movsd xmm6 xmm5
unpckhpd xmm5 xmm5
divsd xmm0 xmm4 ; we are dividing pX/pY if abs(px) < abs(py)
; And do the polynomial routine. A Taylor series of 4th degree of the tangent math
xorpd xmm1 xmm0
movsd xmm4 xmm1
mulsd xmm0 xmm0 ; Y = Y^2
movlpd xmm2 X$Atan2_FactorA9 ; FactorA9
mulsd xmm2 xmm0 ; A9*Y = FactorA9*Y
movlpd xmm3 X$Atan2_Limit3
addsd xmm3 xmm0 ; Atan2_Limit3+Y
addsd xmm1 xmm6
subsd xmm6 xmm1
addsd xmm6 xmm4
addsd xmm2 X$Atan2_Limit2 ; FactorA9*Y+Atan2_Limit2
mulsd xmm3 xmm0 ; (Atan2_Limit3+Y)*Y
mulsd xmm4 xmm0
addsd xmm6 xmm5
mulsd xmm2 xmm4
addsd xmm3 X$Atan2_Limit4 ; Atan2_Limit4 + (Atan2_Limit3+Y)*Y
mulsd xmm2 xmm3
addsd xmm2 xmm6
addsd xmm1 xmm2 ; xmm1 = -HalfPi, HalfPi or 0
movupd xmm0 xmm1
.Else_If D@StartBlockIndexY <= 942;03AE ; Simple_Polynomial used xmm1, xmm3, xmm6, xmm0
mov eax D@pY | movupd xmm2 X$eax | SSE_ABS_REAL8 xmm2
movupd xmm4 X$Atan2_Float_One ; same as: xorpd xmm4 xmm4 | mov ecx 03FF0 | pinsrw xmm4 ecx 3
divsd xmm4 xmm1
mov eax D@MaskedIndex
movupd xmm6 X$Atan2_SGN_TBL+eax*8
unpcklpd xmm3 xmm3
xorpd xmm0 xmm6
xorpd xmm2 xmm6
xorpd xmm3 xmm6
movupd xmm7 X$Atan2_P_TBL2+eax*8
movlpd xmm1 X$Atan2_FactorA9
movlpd xmm5 X$Atan2_Limit3
andpd xmm3 X$Atan2_SELECT_B+eax*8
mulsd xmm2 xmm4
mulsd xmm0 xmm4
movsd xmm6 xmm2
mulsd xmm2 xmm2
mulsd xmm1 xmm2
addsd xmm5 xmm2
mulsd xmm6 xmm2
addsd xmm1 X$Atan2_Limit2
mulsd xmm5 xmm2
addsd xmm7 xmm0
addpd xmm7 xmm3
mulsd xmm1 xmm6
addsd xmm5 X$Atan2_Limit4
mulsd xmm5 xmm1
addsd xmm5 xmm7
unpckhpd xmm7 xmm7
addsd xmm5 xmm7
movupd xmm0 xmm5
.Else ; Big_or_Small
call Sse2_atan2_Big_or_Small
.End_If
EndP
Newer version is:
Proc Sse2_atan2_ComputePolynomial:
Arguments @pY, @pX
Local @StartBlockIndexX, @StartBlockIndexY, @PosInTable
Uses esi, ecx
movupd xmm1 X$Atan2_DivPower895 ; 1/2^895 same as: xorpd xmm5 xmm5 | mov eax 0800 | pinsrw xmm5 eax 3
mov eax D@pX | movupd xmm0 X$eax
paddw xmm1 xmm0
psrlq xmm1 29
xorpd xmm2 xmm2
rcpss xmm2 xmm1
psllq xmm2 29
paddw xmm2 X$Atan2_DivPower127 ; xmm3+(1/(2^127)) Warning. The data at this variable must be aligned to 16 bytes
mov eax D@pY | movupd xmm0 X$eax
mulsd xmm2 xmm0 ; Multiply this result with the unpacked data from SSEAtanTmpVal2
paddd xmm2 X$Atan2_Mask4
andpd xmm2 X$Atan2_Mask3 ; get the value of our table = tan(ValueUnknown) = 1.125 etc
movsd xmm1 xmm2 ; xmm5 used to get the index for out tangent table
pextrw eax xmm2 3 | and eax 07FF0 | sub eax 16286 | mov D@StartBlockIndexX eax | add eax 942 | mov D@StartBlockIndexY eax
minsd xmm2 X$Atan2_Float_32
cmplesd xmm1 X$Atan2_Float_32
.If D@StartBlockIndexX <= 1121
; Get tangent index and put in ecx
pextrw eax xmm1 0 | not eax | and eax 1
pextrw ecx xmm2 3 | sub ecx 16286 | add ecx eax | add ecx ecx | mov D@PosInTable ecx
; we already have Y in xmm0, so, let´s put X on xmm1
mov eax D@pX | movupd xmm1 X$eax
; get our masked index
unpcklpd xmm0 xmm1 | movmskpd ecx xmm0 | add ecx ecx; | mov D@MaskedIndex eax
SSE_ABS_REAL8 xmm0 ; Turn Y in Positive
SSE_ABS_REAL8 xmm1 ; Turn X in Positive
; xmm2 already have our tangent pos index
pshufd xmm2 xmm2 {SHUFFLE 1,0,1,0} ; we are simply coying from LowQuadrord to Hi Quaddword
mulsd xmm2 xmm0 ; Y*z
addsd xmm2 xmm1 ; X+Y*z
pshufd xmm2 xmm2 SSE_SWAP_QWORDS ; swap (X + Y z)in xmm2, to we reuse it
mulsd xmm1 xmm2; X*z
subsd xmm0 xmm1; Y - X z
movhlps xmm1 xmm2 ; copy (X + Y z) in the hiqword of xmm2 to low Qword in xmm1
divsd xmm0 xmm1 ; finally get our divisor
; ----------------------- Start Calculating Polynomial - Taylor Series of Atan
pshufd XMM0 XMM0 {SHUFFLE 1,0,1,0} ; we are simply coying from LowQuadrord to Hi Quaddword = PSHUFD XMM0 XMM0 044h
mulsd xmm0 xmm0 ; the lowquadword will be a power of our divisor Y^4
movsd xmm1 xmm0 ; Copy the double Y to xmm1
; 1St Step. Calculate C = (0.1108941714238824 Y^2 - 0.1428569642436445)*Y^2
mulsd xmm1 X$Atan2_FactorA9
addsd xmm1 X$Atan2PolynomialFactor1A
mulsd xmm1 xmm0
; 2nd step. Calculate B = Y^2*(C+0.1999999999420338)
addsd xmm1 X$Atan2PolynomialFactor2A
mulsd xmm1 xmm0
; 3rd step. Calculate A = Y^2*(B - 0.3333333333333274)
addsd xmm1 X$Atan2PolynomialFactor3A
mulsd xmm1 xmm0
; 4th step. Calculate Y*(A+1)
pshufd xmm0 xmm0 SSE_SWAP_QWORDS ; swap Y <-- Y^2 in xmm0 = PSHUFD XMM0 XMM0 78 (decimal)
addsd xmm1 X$Atan2_Float_One
mulsd xmm0 xmm1 ; result in xmm0
mov eax D@PosInTable
addsd xmm0 X$TangentTbl+eax*8+TangenTbl.MainValueDis
addsd xmm0 X$TangentTbl+eax*8+TangenTbl.RemainderValueDis
; Now we can safelly apply our mask TestingAtan2Data1
xorpd xmm0 X$Atan2_SGN_TBLA+ecx*8 | addsd xmm0 X$Atan2_P_TBLA+ecx*8 | addsd xmm0 X$Atan2_P_TBLA+ecx*8+8
.Else_If D@StartBlockIndexY <= 942
movsd xmm2 X$Atan2ClosetoZeroIndex ;esi+TangenTbl.TangPosDis = 0.029296875 (Fixed value when the value is closer to zero or near Pi)
; got our true pointer
; No need for index here. We need only to divide x/y
; we already have Y in xmm0, so, let´s put X on xmm1
movups xmm1 X$Atan2_Float_One
mov eax D@pX | movupd xmm1 X$eax ; Get X and copy it to xmm0
SSE_ABS_REAL8 xmm0 ; Turn Y in Positive
SSE_ABS_REAL8 xmm1 ; Turn X in Positive
divsd xmm0 xmm1
; get our masked index
unpcklpd xmm0 xmm1 | movmskpd ecx xmm0 | add ecx ecx; | mov D@MaskedIndex eax
; ----------------------- Start Calculating Polynomial - Taylor Series of Atan
pshufd XMM0 XMM0 {SHUFFLE 1,0,1,0} ; we are simply coying from LowQuadrord to Hi Quaddword
mulsd xmm0 xmm0 ; the lowquadword will be a power of our divisor Y^4
; -9.05387052965142591e-4 8.19725715676905915381763684962193281e-7
movsd xmm1 xmm0 ; Copy the double Y to xmm1
; 1St Step. Calculate C = (0.1108941714238824 Y^2 - 0.1428569642436445)*Y^2
mulsd xmm1 X$Atan2_FactorA9 | addsd xmm1 X$Atan2PolynomialFactor1A | mulsd xmm1 xmm0
; 2nd step. Calculate B = Y^2*(C+0.1999999999420338)
addsd xmm1 X$Atan2PolynomialFactor2A | mulsd xmm1 xmm0
; 3rd step. Calculate A = Y^2*(B - 0.3333333333333274)
addsd xmm1 X$Atan2PolynomialFactor3A | mulsd xmm1 xmm0
; 4th step. Calculate Y*(A+1)
pshufd xmm0 xmm0 SSE_SWAP_QWORDS ; swap Y <-- Y^2 in xmm0
addsd xmm1 X$Atan2_Float_One | mulsd xmm0 xmm1 ; result in xmm0
xorpd xmm0 X$Atan2_SGN_TBLA+ecx*8 | addsd xmm0 X$Atan2_P_TBLA+ecx*8 | addsd xmm0 X$Atan2_P_TBLA+ecx*8+8
.Else ; Big_or_Small
call Sse2_atan2_Big_or_Small
.End_If
EndP
Hi Guga,
I'm a bit lost - what do you mean with "JJ´s Binary search routine"?
Quote from: jj2007 on February 20, 2022, 10:57:27 AM
Hi Guga,
I'm a bit lost - what do you mean with "JJ´s Binary search routine"?
Hi JJ
This one :azn: :azn:
http://masm32.com/board/index.php?topic=7659.msg83723#msg83723
You helped some time ago on a image routine i was testing.
Quote from: guga on February 20, 2022, 06:08:31 PM
Quote from: jj2007 on February 20, 2022, 10:57:27 AM
Hi Guga,
I'm a bit lost - what do you mean with "JJ´s Binary search routine"?
Hi JJ
This one :azn: :azn:
http://masm32.com/board/index.php?topic=7659.msg83723#msg83723
You helped some time ago on a image routine i was testing.
Yes, now I remember. It's called ArrayIndex (https://www.jj2007.eu/MasmBasicQuickReference.htm#Mb1179) now.
Hi Guga!
Quote from: guga on February 20, 2022, 01:37:11 AM
Since it is using less registers and i removed some opcodes, it should be faster.
That is a bad assumption!
See instruccion timing (http://masm32.com/board/index.php?topic=9666.msg106081#msg106081).
Regards, HSE.
Quote from: HSE on February 21, 2022, 12:05:14 AM
Hi Guga!
Quote from: guga on February 20, 2022, 01:37:11 AM
Since it is using less registers and i removed some opcodes, it should be faster.
That is a bad assumption!
See instruccion timing (http://masm32.com/board/index.php?topic=9666.msg106081#msg106081).
Regards, HSE.
Indeed. You are right. Tks, HSE :thumbsup: I´ll need to stay with the original version instead. Or...another solution is use the LolRemez polynomials to speed it up a bit, but i´m not sure about accuracy yet. I´ll give a try converting aa app made in FreeBasic that do this LolRemez polynomials to be studied. The only problem is that i´ll also need to try to create a dll for FreeBasic to this thing works properly (Ouch.). If i see that the FreeBasic converstion will take too many time, i´ll give up and try to create a Lolremez app by scratch to test the accuracy for the algorithm. Btw..a LolRemez version for masm (or RosAsm, Fasm, nasm etc) is utterly needed for we test other ways to create faster routines that uses complex math.
Quote from: guga on February 23, 2022, 01:00:03 PMcreate faster routines that uses complex math.
If you give me examples what you need, I can try with FastMath() (https://www.jj2007.eu/MasmBasicQuickReference.htm#Mb1191).
Hu JJ
I don´t remember right now any working example. I´ll take a look later, ok ? The goal is create a tool that allow us to do complex calculations on a faster way using this lolremez technique (or other) that can result on approximation values, like the ones we discussed some time ago with Siekmansk at: http://masm32.com/board/index.php?topic=8734.msg95511#msg95511
So, find the polynomials values for Exp(x) = Exp2(x / Logn(2.0)) etc or more complex maths like:
(x^2/log(y)) + sin(x)*atan2(y) etc etc
With lolremez we can produce things like:
lolremez -d 10 -r 0:1 "2**x" "2**x"
// Approximation of f(x) = 2**x
// with weight function g(x) = 2**x
// on interval [ 0, 1 ]
// with a polynomial of degree 10.
double f(double x)
{
double u = 9.9522144894077596e-9;
u = u * x + 9.4609455637620191e-8;
u = u * x + 1.331271998884894e-6;
u = u * x + 1.5244659656760603e-5;
u = u * x + 1.5403957490414841e-4;
u = u * x + 1.3333543730974749e-3;
u = u * x + 9.6181294091749148e-3;
u = u * x + 5.5504108628244985e-2;
u = u * x + 2.402265069613608e-1;
u = u * x + 6.9314718055989127e-1;
return u * x + 1.0000000000000002;
}
.const
Chebylog2E real4 4 dup (1.44269504089) ; 1/Logn(2.0)
Cheby10Exp0 real4 4 dup (9.9522144894077596e-9)
Cheby10Exp1 real4 4 dup (9.4609455637620191e-8)
Cheby10Exp2 real4 4 dup (1.331271998884894e-6)
Cheby10Exp3 real4 4 dup (1.5244659656760603e-5)
Cheby10Exp4 real4 4 dup (1.5403957490414841e-4)
Cheby10Exp5 real4 4 dup (1.3333543730974749e-3)
Cheby10Exp6 real4 4 dup (9.6181294091749148e-3)
Cheby10Exp7 real4 4 dup (5.5504108628244985e-2)
Cheby10Exp8 real4 4 dup (2.402265069613608e-1)
Cheby10Exp9 real4 4 dup (6.9314718055989127e-1)
Cheby10Exp10 real4 4 dup (1.0000000000000002)
.code
align 16
SSE2_Exp_10: ; in: xmm0, out: xmm1
movaps xmm2,oword ptr Chebylog2E
mulps xmm2,xmm0
psrld xmm0,31
cvttps2dq xmm1,xmm2
psubd xmm1,xmm0
movdqa xmm0,xmm1
cvtdq2ps xmm1,xmm1
subps xmm2,xmm1
movaps xmm1,oword ptr Cheby10Exp0
pslld xmm0,23
mulps xmm1,xmm2
addps xmm1,oword ptr Cheby10Exp1
mulps xmm1,xmm2
addps xmm1,oword ptr Cheby10Exp2
mulps xmm1,xmm2
addps xmm1,oword ptr Cheby10Exp3
mulps xmm1,xmm2
addps xmm1,oword ptr Cheby10Exp4
mulps xmm1,xmm2
addps xmm1,oword ptr Cheby10Exp5
mulps xmm1,xmm2
addps xmm1,oword ptr Cheby10Exp6
mulps xmm1,xmm2
addps xmm1,oword ptr Cheby10Exp7
mulps xmm1,xmm2
addps xmm1,oword ptr Cheby10Exp8
mulps xmm1,xmm2
addps xmm1,oword ptr Cheby10Exp9
mulps xmm1,xmm2
addps xmm1,oword ptr Cheby10Exp10
paddd xmm0,xmm1
ret
So, a app where we can input a math formula and result the full code with the polynomials already computed can be very handy
If you can bring (x^2/log(y)) + sin(x)*atan2(y) into the form y=f(x), I can help you (I'm not good at math)
Let´s try it on a more simple function , then.
Try this:
sin(x)+cos(x)
which is the same as:
sqrt(2) * sin(x + Pi/4)
On this example, LolRemez app returned:
G:\HD1_Fino\BackupHD IDE - ok\RosAsm\RosAsm\Development\Dlls\New\LolRemezOld>lolremez.exe sin(x)+cos(x)
// Approximation of f(x) = sin(x)+cos(x)
// on interval [ -1, 1 ]
// with a polynomial of degree 4.
// p(x)=(((3.9297499716796932e-2*x-1.5653249505225839e-1)*x-4.9891262416895649e-1)*x+9.9750048049840423e-1)*x+9.9991743032029927e-1
double f(double x)
{
double u = 3.9297499716796932e-2;
u = u * x + -1.5653249505225839e-1;
u = u * x + -4.9891262416895649e-1;
u = u * x + 9.9750048049840423e-1;
return u * x + 9.9991743032029927e-1;
}
G:\HD1_Fino\BackupHD IDE - ok\RosAsm\RosAsm\Development\Dlls\New\LolRemezOld>pause
Press any key to continue . . .
Quote from: guga on March 02, 2022, 09:09:07 AM
Try this:
sin(x)+cos(x)
FastMath SinPlusCos
For_ ct=-1 To 360 ; max 10,000 iterations
fild ct ; X
fld st
fstp REAL10 ptr [edi]
fmul FP8(0.01745329252) ; degrees->rad
fsincos
fadd
fstp REAL10 ptr [edi+REAL10]
add edi, 2*REAL10
Next
FastMath
NameA equ SinPlusCos FastMath
TestA proc
mov ebx, AlgoLoops-1 ; loop e.g. 100x
xor ecx, ecx
align 4
.Repeat
void SinPlusCos(ecx) ; uses FastMath
fstp res8
inc ecx
.Until ecx>=360
ret
TestA endp
TestA_endp:
align_64
TestB_s:
NameB equ SinPlusCos Fpu
TestB proc
mov ebx, AlgoLoops-1 ; loop e.g. 100x
xor ecx, ecx
align 4
.Repeat
push ecx
fild stack
pop ecx
fmul FP8(0.01745329252) ; degrees->rad
fsincos ; uses the FPU
fadd
fstp res8
inc ecx
.Until ecx>=360
ret
TestB endp
Intel(R) Core(TM) i5-2450M CPU @ 2.50GHz (SSE4)
6745 cycles for 100 * SinPlusCos FastMath
39833 cycles for 100 * SinPlusCos Fpu
6797 cycles for 100 * SinPlusCos FastMath
39854 cycles for 100 * SinPlusCos Fpu
6741 cycles for 100 * SinPlusCos FastMath
39812 cycles for 100 * SinPlusCos Fpu
6745 cycles for 100 * SinPlusCos FastMath
39815 cycles for 100 * SinPlusCos Fpu
186 bytes for SinPlusCos FastMath
27 bytes for SinPlusCos Fpu
Real8 0.9823952887398167411 SinPlusCos FastMath
Real8 0.9823952887398167411 SinPlusCos Fpu
exact 0.9823952887191077263
Now the question is, of course, how fast and how accurate is LolRemez :rolleyes:
Great work. But still uses a precalculate table, right ?
About Lolremez, i´m not sure how accurate it is. I didn´t tested the results on the example i gave. For what i saw so far, the accuracy depends on the amount of polynomials to be calculated. For example, using a 4th degree polyomial is less accurate than using a 9th degree. If we could build a routine that can be able to do the maths of those equations using the polynomials approximations (ChebyChev, LolRemez, MinMax etc etc), we can achieve a higher level of accuracy using more polynomials (and yet, this shouldn´t affect performance since we can do it in SSE2 using few instructions, without any loops).
This library seems to have a more accurate results, but didn´t tested yet.
https://www.boost.org/doc/libs/1_37_0/libs/math/doc/sf_and_dist/html/math_toolkit/using_udt/use_ntl.html
Or this one..seems to be even more accurated
https://www.boost.org/doc/libs/1_37_0/libs/math/doc/sf_and_dist/html/math_toolkit/backgrounders/lanczos.html
I thought you had the FreeBasic DLL for testing LolRemez... :cool:
It would be fun to add it to the SinPlusCos.zip testbed
Quote from: jj2007 on March 02, 2022, 10:19:45 AM
I thought you had the FreeBasic DLL for testing LolRemez... :cool:
It would be fun to add it to the testbed
Yes, this is what i´m trying to do. Im quite finishing rebuilding Fb LolRemez to test. The main problem is that i got stuck today on a routine called "GaussJordan". I´m trying to fix it before go further on the porting.
I've tested it now, see above, SinPlusCosV2.zip
It's fast but not very accurate. I wonder how fast it is with the same accuracy as the fpu or FastMath (https://www.jj2007.eu/MasmBasicQuickReference.htm#Mb1482). Also, -1 ... +1 is not the full 0...360 degrees range.
guga, if you are interested I can post a rational minimax approximation of arctan in the interval -1 to 1 with max error 1.34e-16
Quote from: jj2007 on March 02, 2022, 10:50:32 AM
I've tested it now, see above, SinPlusCosV2.zip
It's fast but not very accurate. I wonder how fast it is with the same accuracy as the fpu or FastMath (https://www.jj2007.eu/MasmBasicQuickReference.htm#Mb1482). Also, -1 ... +1 is not the full 0...360 degrees range.
Hi JJ. I saw it. Indeed, it´s very fast, but lack precision (probably because it is using a 4th degree polynomial and not 9th degree). I´ll make further tests on it to se if Fb version is more accurate and how can we improve it more.
Quote from: jack on March 02, 2022, 11:17:25 AM
guga, if you are interested I can post a rational minimax approximation of arctan in the interval -1 to 1 with max error 1.34e-16
Hi Jack, yes, please. :) Also, post how you achieved such result :) It would be nice trying o implement a app using better approximation technique.
Hi JJ
try with these, please:
double f(double x)
{
double u = 2.681518169196906e-6;
u = u * x + 2.4121467561738585e-5;
u = u * x + -1.9833112700127753e-4;
u = u * x + -1.3882965671023472e-3;
u = u * x + 8.3332934231267097e-3;
u = u * x + 4.1666455708179865e-2;
u = u * x + -1.6666665852866353e-1;
u = u * x + -4.9999997368823986e-1;
u = u * x + 9.9999999952226541e-1;
return u * x + 9.999999994749521e-1;
}
I tried to insert it on masmbasic, but an error ocurred
Hi Gustavo,
Here are the results:
Intel(R) Core(TM) i5-2450M CPU @ 2.50GHz (SSE4)
2241 cycles for 100 * SinPlusCos FastMath
12007 cycles for 100 * SinPlusCos Fpu
950 cycles for 100 * LolRemez 5
2329 cycles for 100 * LolRemez 8
2106 cycles for 100 * SinPlusCos FastMath
11833 cycles for 100 * SinPlusCos Fpu
950 cycles for 100 * LolRemez 5
2329 cycles for 100 * LolRemez 8
2108 cycles for 100 * SinPlusCos FastMath
11833 cycles for 100 * SinPlusCos Fpu
949 cycles for 100 * LolRemez 5
2330 cycles for 100 * LolRemez 8
2106 cycles for 100 * SinPlusCos FastMath
11832 cycles for 100 * SinPlusCos Fpu
950 cycles for 100 * LolRemez 5
2330 cycles for 100 * LolRemez 8
183 bytes for SinPlusCos FastMath
24 bytes for SinPlusCos Fpu
72 bytes for LolRemez 5
124 bytes for LolRemez 8
Real8 1.383309602960451023 SinPlusCos FastMath
Real8 1.383309602960451023 SinPlusCos Fpu
Real8 1.382865190370866637 LolRemez 5
Real8 0.3833096037593586303 LolRemez 8
exact 1.383309602960451112
There is one tiny bit wrong, it seems. With 8 coefficients, it becomes difficult to keep track... see TestD :sad:
LolRemez 8 is slightly slower than FastMath, and it is very limited in its range. You have chosen -1 ... +1, which translates to degrees would be -57...58 degrees. Can you give me the coefficients for -PI ... +PI? That is FastMath's range in the attached version 3 (but only -57...+58 are used, in order to yield the same result as LolRemez).
Re precision: FastMath yields 16 digits of precision, compared to the value that I get with the Windows Calculator.
That´s weird.
I thought that giving a range of -1 to 1 mean s that we are dealing with -Pi to Pi covering the full range, but using normalized values. I´ll take a look later to see how exactly this lolremez works
https://github.com/samhocevar/lolremez
I found some other interesting articles about LolRemez tests here:
http://a-hackers-craic.blogspot.com/2014/07/a-better-faster-sincoscexpi.html
and
https://github.com/samhocevar/lolremez
https://asmbits.blogspot.com/2016/03/vector-sin-and-cos-sse2.html
guga, sorry for being late but just in case you want to give these a try
(x *(8883.0849736588075911 +
x^2 *(20222.7394253468435665 +
x^2 *(16159.2278648857071736 +
x^2 *(5387.4278783915353868 +
x^2 *(686.94856409857219691 +
22.393048501019117472 *
x^2))))))/(8883.0849736588087835 +
x^2 *(23183.7677498992921884 +
x^2 *(22110.5334534862024463 +
x^2 *(9389.8642844889571880 +
x^2 *(1719.75402628941589287 +
x^2 *(107.898060132482141870 + x^2))))))
I used Mathematica with the following steps
first load the package
<< FunctionApproximations`
then approximate the function
f = MiniMaxApproximation[ ArcTan[Sqrt[x]]/Sqrt[x], {x, {1/10^50, 1}, 5, 6}, WorkingPrecision -> 80, Brake -> {500, 5}]
and extract the polynomial from the resulting list
f = f[[2, 1]]
the reason for approximating Arctan(Sqrt(x))/Sqrt(x) is to eliminate the odd polynomial so instead of
x - x^3/3 + x^5/5 ... you have 1 - x/3 + x^2/5 - x^3/7 and so the approximating polynomial won't have even powers of x with small coefficients but now we need to convert the polynomial obtained back to odd and at the same time reduce the digits of the polynomial
f = N[(f /. x -> x*x)*x, 22]
and convert the polynomial to HornerForm
HornerForm[f]
guga, you could reduce the argument as follows, if x=Pi*2 then arctan(x) = 2*arctan(x/(1+sqrt(1+x^2))) so instead of arctan(6.28...) you calculate 2 * arctan(.853...)
reducing the interval to less than 0..1 for the minimax approximation doesn't payoff
if you reduce the argument x as follows ( x in the range -Pi*2 .. Pi*2 )
for i=1 to 5
x=x/(1+sqrt(1+x*x))
next i
then you could calculate the arctan using this polynomial
y = 32*((x * (202.153477236938512481 +
x^2 * (305.19168618485308633 +
x^2 * (127.474551705012762179 +
12.6747296167999257198 * x^2))))/(202.153477236938513333 +
x^2 * (372.576178597164761655 +
x^2 * (211.235915790272755720 +
x^2 * (37.4505339820668783381 + x^2)))))
or you could simply use the power series
y = 32*(x-x^3/3+x^5/5-x^7/7+x^9/9-x^11/11)
or in HornerForm
y = 32*(x*(1+x^2*(-(1/3)+x^2*(1/5+x^2*(-(1/7)+x^2*(1/9-x^2/11))))))
or
y = x *(32 + x^2*(-(32/3) + x^2*(32/5 + x^2*(-(32/7) + x^2*(32/9 - (32*x^2)/11)))))
Quote from: guga on March 02, 2022, 04:50:01 PMI thought that giving a range of -1 to 1 mean s that we are dealing with -Pi to Pi covering the full range, but using normalized values.
Nope. If they were normalised, I would not get the correct values...
New version for sin+cos, this is now the code to calculate y(x):
PushPoly ecx*0.01745329251994329577,\
3.9297499716796932e-2,\
-1.5653249505225839e-1,\
-4.9891262416895649e-1,\
9.9750048049840423e-1,\
9.9991743032029927e-1
fstp res8
The *0.017 translates degrees to rad. The two LolRemez functions operate over the limit range of -57...+58 degress. If you give me the coefficients for the full -180...+180 or 0...360 degrees range, I can adapt them accordingly.
Timings:
Intel(R) Core(TM) i5-2450M CPU @ 2.50GHz (SSE4)
2256 cycles for 100 * SinPlusCos FastMath
11833 cycles for 100 * SinPlusCos Fpu
1308 cycles for 100 * LolRemez 5
3258 cycles for 100 * LolRemez 8
2106 cycles for 100 * SinPlusCos FastMath
12328 cycles for 100 * SinPlusCos Fpu
1309 cycles for 100 * LolRemez 5
3177 cycles for 100 * LolRemez 8
183 bytes for SinPlusCos FastMath
24 bytes for SinPlusCos Fpu
72 bytes for LolRemez 5
122 bytes for LolRemez 8
Real8 1.383309602960451023 SinPlusCos FastMath
Real8 1.383309602960451023 SinPlusCos Fpu
Real8 1.382865190370866637 LolRemez 5
Real8 1.383309603234310847 LolRemez 8
exact 1.383309602960451112
Hi Guys
JJ, i don´t know yet how exactly the ranges works on the original lolremez version.
You can give a shot with the attached file or on it´s documentation here: https://github.com/samhocevar/lolremez
About Fb itself, i´m finishing to port, but i must confess, this is one of the most painful things i ever saw before. FB inserts tons of useless code during compiling and the code is so bad organized that it´s really really hard to port or even understand the code flow.
For example, to put a simple line on the console, FB code, seems to be easy, fast and reliable, right ? Something at principle, simple as:
' puts (String(Loword(Width)*Hiword(Width)," ")+Chr(10)) 'clear console
puts("_____________________________")
puts(!"Approxomating Polynomial\n")
sh=copy + " [" +str(lower) + " To " + str(upper) + "]"
Ok...but, that´s not true at all. Internally for inserting a simple line, FB does this:
; puts (String(Loword(Width)*Hiword(Width)," ")+Chr(10)) 'clear console
C_call 'msvcrt.puts' {B$ "_____________________________", 0}
C_call 'msvcrt.puts' {B$ "Approximating Polynomial", 0A, 0}
call 'FbRtl32.fb_StrAssign' HValue, 0-1, DValue, 0-1, 0
call 'FbRtl32.fb_StrConcatAssign' HValue, 0-1, {B$ " [", 0}, 4, 0
call 'FbRtl32.fb_DoubleToStr' D$Lower, D$Lower+4
call 'FbRtl32.fb_StrConcatAssign' HValue, 0-1, eax, 0-1, 0
call 'FbRtl32.fb_StrConcatAssign' HValue, 0-1, {B$ " To ", 0}, 5, 0
call 'FbRtl32.fb_DoubleToStr' D$Upper, D$Upper+4
call 'FbRtl32.fb_StrConcatAssign' HValue, 0-1, eax, 0-1, 0
call 'FbRtl32.fb_StrConcatAssign' HValue, 0-1, {B$ "]", 0}, 2, 0
C_call 'msvcrt.puts' D$HValue.pStringData
When, in fact, all of the above code, each function goes on several windows api (msvcrt), instead simply doing it in 2 lines of code (In fact, using only 2 of these apis: sprintf and puts) like this:
[Sz_PolyminalIntro: B$ "_____________________________
Approximating Polynomial
%s [%.16g To %.16g]", 0]
[Sz_OutputStrBuff: B$ 0 #512]
C_call 'msvcrt.sprintf' Sz_OutputStrBuff, Sz_PolyminalIntro, D$DValue.pStringData, D$Lower, D$Lower+4, D$Upper, D$Upper+4
C_call 'msvcrt.puts' Sz_OutputStrBuff
Again, the whole idea is not bad, but the internal organization is terrible. On the other hand, BCX for example was used to produce a way cleaner code. (Don´t know how is it doing right now, because didn´t had time o test, but as far i remember, BCX was way more robust then FB - at least in terms of internal organization)
Anyway..i´ll continue to port this, and will try to finish today or tomorrow. Keep in mind, that, if FB did that mess on a single string outputted to a console, you may get an idea of what a hell i´m trying to fix on the rest of the code :biggrin: :biggrin: :biggrin: :biggrin:
Quote from: guga on March 03, 2022, 05:27:22 AMJJ, i don´t know yet how exactly the ranges works on the original lolremez version.
It's the
-r parameter:
lolremez --double -d 9 -r "-pi:pi" "sin(x)+cos(x)"
pause
QuoteYou can give a shot with the attached file
Here are the timings: LolRemez is faster if you don't need precision. Note these are results over the full -180...+180 degrees range:
Intel(R) Core(TM) i5-2450M CPU @ 2.50GHz (SSE4)
6476 cycles for 100 * SinPlusCos FastMath
39649 cycles for 100 * SinPlusCos Fpu
5289 cycles for 100 * LolRemez 6
6341 cycles for 100 * LolRemez 7
7350 cycles for 100 * LolRemez 8
10189 cycles for 100 * LolRemez 10
6480 cycles for 100 * SinPlusCos FastMath
39487 cycles for 100 * SinPlusCos Fpu
5296 cycles for 100 * LolRemez 6
6334 cycles for 100 * LolRemez 7
7378 cycles for 100 * LolRemez 8
10195 cycles for 100 * LolRemez 10
75 bytes for SinPlusCos FastMath
33 bytes for SinPlusCos Fpu
83 bytes for LolRemez 6
93 bytes for LolRemez 7
103 bytes for LolRemez 8
123 bytes for LolRemez 10
exact -0.9823952887191077263
Real8 -0.9823952887191077510 SinPlusCos FastMath
Real8 -0.9823952887191077510 SinPlusCos Fpu
Real8 -0.9580416352510443546 LolRemez 6
Real8 -0.9768328662750306313 LolRemez 7
Real8 -0.9833290686311381146 LolRemez 8
Real8 -0.9823756874760426472 LolRemez 10
Hi JJ. So, on this example it´s not totally usefull. I´m finishing the app to we test later, but ...in meanwhile i found some interesting material that can improve accuracy using something called Caratheodory-Fejer approximation. This method seems to use the Chebyshev coefficients to decrease the error.
it´s way out of my head right now. But, these articles are the kind of material that Siekmanski and Jack would like :azn: :azn: :azn:
https://encyclopediaofmath.org/wiki/Carath%C3%A9odory-Fej%C3%A9r_problem
https://www.embeddedrelated.com/showarticle/152.php
https://www.chebfun.org/docs/guide/guide04.html
https://www.mathworks.com/matlabcentral/fileexchange/22055-caratheodory-fejer-approximation
https://books.google.com.br/books?id=xsnBDwAAQBAJ&pg=PA159&lpg=PA159&dq=Caratheodory-Fejer+approximation,&source=bl&ots=uIKjZvoFNT&sig=ACfU3U1fb_hKZ9FRvyqG7b5y5KO4pCWsnA&hl=pt-BR&sa=X&ved=2ahUKEwiRn8_X7Kj2AhXJr5UCHQ8tDVo4HhDoAXoECA8QAw#v=onepage&q=Caratheodory-Fejer%20approximation%2C&f=false
Quote from: guga on March 03, 2022, 01:11:56 PM
Hi JJ. So, on this example it´s not totally usefull.
FastMath (https://www.jj2007.eu/MasmBasicQuickReference.htm#Mb1482) is a factor 6 faster than the FPU, and on par with a LolRemez with 7 coefficients. The latter deviates by half a percent from the true result; that is 9 pixels on a screen 1600 pixels wide... so it depends on your usage. IIRC you need it for colours, so that is 256*0.5%->+-0.7 for the LolRemez 7. I suppose you can live with that :smiley:
6476 cycles for 100 * SinPlusCos FastMath
39649 cycles for 100 * SinPlusCos Fpu
5289 cycles for 100 * LolRemez 6
6341 cycles for 100 * LolRemez 7
...
exact -0.9823952887191077263
Real8 -0.9823952887191077510 SinPlusCos FastMath
Real8 -0.9823952887191077510 SinPlusCos Fpu
Real8 -0.9580416352510443546 LolRemez 6
Real8 -0.9768328662750306313 LolRemez 7
Hi JJ
Agree when you say that for colors, it may not be that precise, but for other needs it maybe necessary something that grants more precision. I´m quite finishing the Fb version and will release it soon to we test both the Dodicat´s approach and the FB runtime dlls we created :thumbsup: :thumbsup: :thumbsup:
This is the app fully working on the same way as the original. I´ll clean up the code and will go to the fix of the export (savings) now.
One question, i plan to export the result also in Masm and RosAsm syntax, so we can use this thing if needed, but i don´t know how to create multiline comments in masm.
In Rosasm when i want a multiline comment, insert it all in between double semicolon in between the text. Like:
;;
This is a comment
;;
How i do it in masm ? What is the syntax that allow me to do multi-line comments in masm ?
(https://i.ibb.co/vdnt4q5/vszx-Untitled.jpg) (https://ibb.co/HP8JT4v)
Standard is:
comment *
This is a comment
*
Can be whatever character not used in commented lines.
To comment code:if 0
commented code
endif
because you can uncomment:if 1
commented code
endif
Quote from: HSE on March 04, 2022, 06:09:56 AMTo comment code:if 0
commented code
endif
That's what I use (169 occurrences in the RichMasm source...)
For comparing several variants, I often use
if 0
... code version A ...
elseif 1
... code version B ... (will be used in this example)
else
... code version C ...
endif
Hi,
There is the "COMMENT" directive.
COMMENT !
That works as [COMMENT Symbol] and when the symbol
next appears, it ends the comments section. (With the
following line as code, I think.)
!
; This is now normal code.
MOV EAX,-1
Cheers,
Steve N.
Quote from: FORTRANS on March 04, 2022, 08:19:22 AM
Hi,
There is the "COMMENT" directive.
COMMENT !
That works as [COMMENT Symbol] and when the symbol
next appears, it ends the comments section. (With the
following line as code, I think.)
!
; This is now normal code.
MOV EAX,-1
Cheers,
Steve N.
thanks Steve :thumbsup:
makes it easier to translate C style
/*
multiline comment
*/
How are you calculating the "lolremez" AKA chebyshev coefficients? Are you using the formula or solving the simultaneous equations? I got stuck and gave up ages ago.
See Reply #61