The MASM Forum
General => The Laboratory => Topic started by: jj2007 on December 21, 2020, 11:16:57 AM

Two algos that calculate arcsin(x) in the range x=0 ... 0.5. The first one, Arcsinus(), uses Raymond's tutorial (http://www.ray.masmcode.com/tutorial/fpuchap10.htm), the second algo uses FastMath with Arcsinus() values:
FastMath ArcSin ; define a math function
For_ fct=0.0 To 1.0 Step 0.0001
fld fct
fstp REAL10 ptr [edi]
void Arcsinus(fct)
fstp REAL10 ptr [edi+REAL10]
add edi, 2*REAL10
Next
FastMath
May I have some timings please?
Intel(R) Core(TM) i52450M CPU @ 2.50GHz (SSE4)
15296 cycles for 100 * Arcsinus
1907 cycles for 100 * ArcSin
15373 cycles for 100 * Arcsinus
1899 cycles for 100 * ArcSin
15238 cycles for 100 * Arcsinus
1912 cycles for 100 * ArcSin
15206 cycles for 100 * Arcsinus
1910 cycles for 100 * ArcSin
15219 cycles for 100 * Arcsinus
1905 cycles for 100 * ArcSin
58 bytes for Arcsinus
209 bytes for ArcSin
Real8 29.99999926061033761 Arcsinus
Real8 29.99999926061034117 ArcSin

Intel(R) Core(TM) i74930K CPU @ 3.40GHz (SSE4)
18643 cycles for 100 * Arcsinus
2256 cycles for 100 * ArcSin
18636 cycles for 100 * Arcsinus
2246 cycles for 100 * ArcSin
18650 cycles for 100 * Arcsinus
2261 cycles for 100 * ArcSin
18629 cycles for 100 * Arcsinus
2255 cycles for 100 * ArcSin
18640 cycles for 100 * Arcsinus
2261 cycles for 100 * ArcSin
58 bytes for Arcsinus
209 bytes for ArcSin
Real8 29.99999926061033761 Arcsinus
Real8 29.99999926061034117 ArcSin
 ok 

Thanks, Marinus :thup:
I wonder why my old i5 is a tick faster... doesn't make much sense :cool:
Core i52450M (https://www.cpubenchmark.net/cpu.php?cpu=Intel+Core+i52450M+%40+2.50GHz&id=800)
Core i74930K (https://www.cpubenchmark.net/cpu.php?cpu=Intel+Core+i74930K+%40+3.40GHz&id=2023)

Hi Jochen,
My system is clocked down to prevent noise, for live audio recordings.

Intel(R) Core(TM) i57200U CPU @ 2.50GHz (SSE4)
16580 cycles for 100 * Arcsinus
1378 cycles for 100 * ArcSin
16408 cycles for 100 * Arcsinus
1378 cycles for 100 * ArcSin
16397 cycles for 100 * Arcsinus
1358 cycles for 100 * ArcSin
16463 cycles for 100 * Arcsinus
1368 cycles for 100 * ArcSin
16674 cycles for 100 * Arcsinus
1358 cycles for 100 * ArcSin
58 bytes for Arcsinus
209 bytes for ArcSin
Real8 29.99999926061033761 Arcsinus
Real8 29.99999926061034117 ArcSin

also wonder how the oldschool raycasting optimization stand compared to this:an arccos LUT? :biggrin:
I have a general SSE trigo PROC thats untested,just input 4 floats and offset value that controls which set of constants it points to it,so it become different taylor series
@Marinus
I thought you underclock it because it would be a bigger challenge to optimize it to run on slower cpu :badgrin:
AMIGA clock speed today would really challenging :biggrin:

@Magnus
Still miss the Amiga days, banging directly to the hardware was a lot of fun.

:biggrin: What its ArcSinInit's timing?

:biggrin: What its ArcSinInit's timing?
Test yourself  it might be enough for a coffee break, who knows? :biggrin:
ArcSinInit:
NanoTimer()
FastMath ArcSin ; define a math function
For_ fct=0.0 To 1.0 Step 0.0001
fld fct
fstp REAL10 ptr [edi]
void Arcsinus(fct)
fstp REAL10 ptr [edi+REAL10]
add edi, 2*REAL10
Next
FastMath
PrintLine NanoTimer$(), " for initialising the ArcSin macro"
retn

Test yourself  it might be enough for a coffee break, who knows? :biggrin:
I tried previously but I have a little crash related with memory allocation :biggrin:

Post the exe, I am curious

Intel(R) Core(TM) i34150 CPU @ 3.50GHz (SSE4)
18942 cycles for 100 * Arcsinus
2095 cycles for 100 * ArcSin
18950 cycles for 100 * Arcsinus
2100 cycles for 100 * ArcSin
19068 cycles for 100 * Arcsinus
2137 cycles for 100 * ArcSin
18970 cycles for 100 * Arcsinus
2596 cycles for 100 * ArcSin
18904 cycles for 100 * Arcsinus
2112 cycles for 100 * ArcSin
58 bytes for Arcsinus
209 bytes for ArcSin
Real8 29.99999926061033761 Arcsinus
Real8 29.99999926061034117 ArcSin
 ok 

AMD Ryzen 5 3400G with Radeon Vega Graphics (SSE4)
19735 cycles for 100 * Arcsinus
2530 cycles for 100 * ArcSin
19817 cycles for 100 * Arcsinus
2535 cycles for 100 * ArcSin
19796 cycles for 100 * Arcsinus
2309 cycles for 100 * ArcSin
19779 cycles for 100 * Arcsinus
2326 cycles for 100 * ArcSin
19822 cycles for 100 * Arcsinus
2321 cycles for 100 * ArcSin
58 bytes for Arcsinus
209 bytes for ArcSin
Real8 29.99999926061033761 Arcsinus
Real8 29.99999926061034117 ArcSin


Post the exe, I am curious
:thumbsup: Was missing MbProHeap initialization: ifdef MbBufferInit
call MbBufferInit
endif

AMD A69220e RADEON R4, 5 COMPUTE CORES 2C+3G (SSE4)
1.60 GHz
28377 cycles for 100 * Arcsinus
4369 cycles for 100 * ArcSin
28350 cycles for 100 * Arcsinus
4247 cycles for 100 * ArcSin
28358 cycles for 100 * Arcsinus
4414 cycles for 100 * ArcSin
28341 cycles for 100 * Arcsinus
4440 cycles for 100 * ArcSin
28374 cycles for 100 * Arcsinus
4300 cycles for 100 * ArcSin
58 bytes for Arcsinus
209 bytes for ArcSin
Real8 29.99999926061033761 Arcsinus
Real8 29.99999926061034117 ArcSin
Windows 7 Pro, 32 bit

Intel(R) Core(TM) i57300U CPU @ 2.60GHz (SSE4)
14448 cycles for 100 * Arcsinus
1225 cycles for 100 * ArcSin
14644 cycles for 100 * Arcsinus
1214 cycles for 100 * ArcSin
14663 cycles for 100 * Arcsinus
1237 cycles for 100 * ArcSin
14606 cycles for 100 * Arcsinus
1206 cycles for 100 * ArcSin
14518 cycles for 100 * Arcsinus
1272 cycles for 100 * ArcSin
58 bytes for Arcsinus
209 bytes for ArcSin
Real8 29.99999926061033761 Arcsinus
Real8 29.99999926061034117 ArcSin

:thumbsup: Was missing MbProHeap initialization: ifdef MbBufferInit
call MbBufferInit
endif
Did you call ArcSinInit before ShowCpu? Normally the Init macro takes care of all that, but my timings template keeps the option open to run it without MasmBasic, so I chose the ifdef MbBufferInit instead.

Did you call ArcSinInit before ShowCpu? Normally the Init macro takes care of all that, but my timings template keeps the option open to run it without MasmBasic, so I chose the ifdef MbBufferInit instead.
Essentially is the code you posted. The MasmBasic book is not so clear :biggrin:
BTW FastMath name sound nice, but is a little misleading. Perhaps MathLUT or something like that, because is a look up table creation macro, not fast calculation. :thumbsup:

BTW FastMath name sound nice, but is a little misleading. Perhaps MathLUT or something like that, because is a look up table creation macro, not fast calculation. :thumbsup:
If it gives you a factor 812 faster math, then the name is not that important ;)

originally made for Guga's color conversions
SIMD taylor version
I love SIMD
0.234714
0.61685
0.380504
0.234714
0.144784
2!,4!,6!,8!
0.5
0.0416667
0.00138889
2.48016e05
cosine result :0.707426
sine result :0.707107
arcsine result:0.900242
times x:1000000
clock cycles :55624547
cycles/loop :55

If it gives you a factor 812 faster math
Not relly.
I'm thinking the equations (without presition acount) :
 Time_of_building_per_access = Time_to_build_table/Number_of_accesses_per_program
 FactorJJ = Time_of_calculation/(Time_of_access + Time_of_building_per_access)
Indiference point is replacing around 25250 calculations. In this point factor is 1.

It takes one millisecond to build the table, Héctor. The more complex the mathematical function is, the more you can gain with FastMath :cool:

The more complex the mathematical function is, the more you can gain with FastMath :cool:
Only if you are calling the table beyond indiference point :biggrin:

It takes one millisecond to build the table, Héctor. The more complex the mathematical function is, the more you can gain with FastMath :cool:
But if you make tables in Workerthread, while windows main thread creates and loads lot of things when it starts,you wouldn't notice the milliseconds it takes for make one or several tables

But if you make tables in Workerthread, while windows main thread creates and loads lot of things when it starts,you wouldn't notice the milliseconds it takes for make one or several tables
Yes, you need even more time to prepare for the game :biggrin:
Tables like this are very usefull in games because you don't need presition.
Only if you are calling the table beyond indiference point :biggrin:
Indeed you have some profit after 50504 table access (almost double of indiference point)

Indeed you have some profit after 50504 table access (almost double of indiference point)
Ok, put Step 0.5 in ArcSinInit and do your optimisation again :badgrin:

Ok, put Step 0.02 in ArcSinInit and do your optimisation again :badgrin:
I you don't need any precition you just could put any number as a solution, that could be even faster :biggrin:

If you don't need precision, check if it's still good enough with Step 0.5:
ArcSinInit:
NanoTimer()
FastMath ArcSin ; define a math function
For_ fct=0.0 To 1.0 Step 0.1
fld fct
fstp REAL10 ptr [edi]
void Arcsinus(fct)
fstp REAL10 ptr [edi+REAL10]
add edi, 2*REAL10
Next
FastMath
PrintLine NanoTimer$(), " for initialising the FastMath macro"
For_ fct=0.05 To 1.0 Step 0.1 ; compare the exact value with the estimate
PrintLine Str$("%3f\t", fct), Str$("%9f\t", Arcsinus(fct)v), Str$("%9f", ArcSin(fct)v)
Next
retn
Internally, FastMath uses SetPoly3 (http://www.jj2007.eu/MasmBasicQuickReference.htm#Mb1124). It's a fairly sophisticated LUT :cool:

Hi JJ!
With step= 0.1 you can have an absolute error= 5.6 x 10^{0} (relative error is 9.6% in that point). That is very big, especially compared with usual default: absolute error = 1.0 x 10^{6}.
Then first you have to find step size for presicion you need, and second you have to choose between direct calculation or table depending on number of access to solution.
Regards. HSE

Right, 9.6% is quite big. Use 0.01 instead, the table gets created in less than 100 microseconds, and the error is 1% max.
If that is not precise enough, go for 0.001:
Intel(R) Core(TM) i52450M CPU @ 2.50GHz (SSE4)
202 µs for initialising the FastMath macro
0.0500 2.86598398 2.86598398
0.150 8.62692656 8.62692656
0.250 14.4775122 14.4775122
0.350 20.4873151 20.4873151
0.450 26.7436840 26.7436840
0.550 33.3670130 33.3670130
0.650 40.5416019 40.5416019
0.750 48.5903779 48.5903789
0.850 58.2116694 58.2116729
0.950 71.8051277 71.8051861
202 microseconds, or 0.2 milliseconds. Given that some of my "professional" software needs minutes to start, I start wondering what is the motivation behind your critical comments, Héctor :badgrin:

I start wondering what is the motivation behind your critical comments, Héctor :badgrin:
Just beating what I can. It's the laboratory:
Algorithm and code design research laboratory. This is the place to post assembler algorithms and code design for discussion, optimisation and any other improvements that can be made on it. Post code here to be beaten to death to make it better, smaller, faster or more powerful. Feel free to explain the optimisation methods used so that everyone can get a feel for the code design.
Tables are interesting tools, especially if you macro make so easy to create, but how to build it and when to use it deserve some considerations.

I start wondering what is the motivation behind your critical comments, Héctor :badgrin:
Just beating what I can. It's the laboratory:
Algorithm and code design research laboratory. This is the place to post assembler algorithms and code design for discussion, optimisation and any other improvements that can be made on it. Post code here to be beaten to death to make it better, smaller, faster or more powerful. Feel free to explain the optimisation methods used so that everyone can get a feel for the code design.
Tables are interesting tools, especially if you macro make so easy to create, but how to build it and when to use it deserve some considerations.
well if you use trigo later for drawing in this 64bit era,wouldnt fixed point table be faster alternative,even 32bit support MUL that results in 32bit in eax,32bit in edx?also DIV use both 32bit registers?

well if you use trigo later for drawing in this 64bit era,wouldnt fixed point table be faster alternative,even 32bit support MUL that results in 32bit in eax,32bit in edx?also DIV use both 32bit registers?
You can timing that to be sure :biggrin:

well if you use trigo later for drawing in this 64bit era,wouldnt fixed point table be faster alternative,even 32bit support MUL that results in 32bit in eax,32bit in edx?also DIV use both 32bit registers?
You can timing that to be sure :biggrin:
it's best to make optimisation and timings also in practical uses,whole tunnel(stargate),sphere(planet) code,float to int conversion take not only some cycles,mixing SSE floating point code and SSE 2 integer code you get some penalty
For example circle , can be from 32x32 sprite to big hires 4k hd screen, 32 diameter* pi vs 1080 diameter * pi, so a general 360 degree LUT, works best for 360 pixel circle, too many for 32diameter and too few for 1080 diameter

AMD A69220e RADEON R4, 5 COMPUTE CORES 2C+3G (SSE4)
1.60 GHz
28377 cycles for 100 * Arcsinus
4369 cycles for 100 * ArcSin
28350 cycles for 100 * Arcsinus
4247 cycles for 100 * ArcSin
28358 cycles for 100 * Arcsinus
4414 cycles for 100 * ArcSin
28341 cycles for 100 * Arcsinus
4440 cycles for 100 * ArcSin
28374 cycles for 100 * Arcsinus
4300 cycles for 100 * ArcSin
58 bytes for Arcsinus
209 bytes for ArcSin
Real8 29.99999926061033761 Arcsinus
Real8 29.99999926061034117 ArcSin
Windows 7 Pro, 32 bit
from new computer, xp 32 bit :biggrin:
Intel(R) Core(TM)2 Duo CPU E8400 @ 3.00GHz (SSE4)
17230 cycles for 100 * Arcsinus
2710 cycles for 100 * ArcSin
17212 cycles for 100 * Arcsinus
2726 cycles for 100 * ArcSin
17212 cycles for 100 * Arcsinus
2714 cycles for 100 * ArcSin
17219 cycles for 100 * Arcsinus
2710 cycles for 100 * ArcSin
17212 cycles for 100 * Arcsinus
2710 cycles for 100 * ArcSin
58 bytes for Arcsinus
209 bytes for ArcSin
Real8 29.99999926061033761 Arcsinus
Real8 29.99999926061034117 ArcSin
 ok 
[\quote]
W@hen I say 'new' its a misnomer. Its a refurished hp 8100 elite SFF box. For $130 USD, complete with monitor keyboard and mouse it's doing a terrific job so far. And drivers are still around for the good old XP. :tongue: