Hi Guys
Another test. This time is a Fast Log10 function on the same way as we are testing the Fast exp on this thread http://masm32.com/board/index.php?topic=8734.0
I succeeded to assemble the file with masm basic (have small errors, but it is working as expected :mrgreen:). The function calculates the Log10 of a number with a precision of 16 digits after the ".".
Also, it checks for NAN, Zero and Negative and positive Infinite and calculates denormalized values.
The resultant value is stored in xmm0. (This time i succeded to make it use less xmm registers :thumbsup: :thumbsup: :thumbsup: )
Important: If you would assemble the function alone, it is mandatory to align the data on a 16 byte boundary due to the usage of 128 bit addressing of xmm registers. (I used memory addresses to minimize the usage of so many xmm registers)
Btw, plsease don´t botter the labels on the outpuuted console. I don´t know how to configure masmbasic to output the proper value :bgrin: :bgrin: :bgrin:
Original value of Log10(5)
Log10(5) = 0.698970004336018804786261105275506973231810118537891458689
My version:
Log10(5) = 0.698970004336018857
In my PC, the average speed is around 2500 cycles, but again..we can make it work even faster with some extra effort. I can, make it works in something around 1600 cycles
(Without loosing precision) simply removing the check for NANs and commenting this line
cmp ecx, 2045 ; Check for NAN, Infinite, Zero and Denormalized values
jbe loc_4C6BFC
(...)
loc_4C6BFC:
But, if we do so, we won´t be able to check for errors on the input number and neither would be able to calculate the log10 of denormalized values.
The parameters of usage are the same as in the exp function. So:
The parameters are:
1st Parameter = The value to be calculated. It can be a integer, a Float or a address to Real8 (or qword)
2nd Parameter - The flag to be used on each type of input. I build a set of equates for that.
SSE_EXP_INT equ 1
SSE_EXP_FLOAT equ 2
SSE_EXP_REAL8 equ 4
SSE_EXP_QWORD equ 8
So, if you want to calculate the log10 of 5, and the input is a integer you must use
invoke Sse2_log10_precise, 5, 1
or
invoke Sse2_log10_precise, 5, SSE_EXP_INT
If the input is a Float (Real4) you use:
MyValue Real4 5.0 ; I believe this is the masm syntax for Float, right ?
invoke Sse2_log10_precise, [MyValue], 2
or
invoke Sse2_log10_precise, [MyValue], SSE_EXP_FLOAT
If the input is a Real8, you must use as a input the address (Offset) of the Value, like this:
MyValue Real8 5.0
invoke Sse2_log10_precise, offset MyValue, 4
or
MyValue Real8 5.0
invoke Sse2_log10_precise, offset MyValue, SSE_EXP_REAL8
and, the same for using to calculate as a QWord (int64), but using 8 as a flag (2nd parameter) and the pointer to the int64 value on the 1st parameter.
Ex:
push 5 ; The value "5" we use to compute Real8 as i explained above
push offset MyExpo ; Pointer to Real8 or to a Qword only when we are using 64 bits values. For 32 Bits values (dword, Float etc),
; we use the value directly because int or Float are not pointers. (So, without the "offset" thing in masm)
call Sse2_log10_precise
RosAsm version is as follows:
[FloatZero: R$ 0]
[SSE_Two52: Q$ 4841369599423283200, 4841369599423283200] ; = 2^52 ;D$ 0, 043300000, 0, 043300000]
[SSE_One: R$ 1, R$ 1]
[SSE_LOG_EMASK < 2.22507385850720082e-308 >]; Using Body Equate to set the proper value which is Closer to DBL_MIN that is: 2.2250738585072014e-308
[SSE_LOG_HIMASK < 1.79769313486231571e+308 >]; Using Body Equate to set the proper value which is Closer to DBL_MAX that is: 1.7976931348623158e+308
[SSE_Emask: R$ SSE_LOG_EMASK, SSE_LOG_EMASK] ; D$ 0-1, 0FFFFF, 0-1, 0FFFFF
[<16 SSE_LOG10_CC_0: R$ (111/256), R$ (111/256)] ; L10EA
[<16 SSE_Magic0: R$ 4.39804651110300781e+12, R$ 4.39804651110300781e+12]
[SSE_LOG10_HIMASK < 07FFFFFFF__80000000 >]; Using Body Equate to set the proper value which is Closer to DBL_MAX that is: 1.7976931348623158e+308
[SSE_HiMask0: Q$ SSE_LOG10_HIMASK, SSE_LOG10_HIMASK]
[<16 SSE_Place_Log2: D$ 0, 0
D$ 0FFFFFFFF, 0FFFFFFFF,
D$ 0FFFFFFFF, 0FFFFFFFF,
D$ 0, 0]
[<16 SSE_Log1020: R$ 3.01029995663952832e-1, R$ 2.83633945510449641e-14] ; Log10(2) = 0.3010299956639811952137388947244930267681898814621085413104274611
[<16 Log_Coeff0: R$ 21.5354732628465832, R$ -3.07179525615370474]
[<16 SSE_Log10Var1: R$ -10.8935578527763628, R$ 1.77588163534834509]
[<16 SSE_Log10Var2: R$ 5.66760060334353621, R$ -1.15501676674018694]
[<16 SSE_Log10Var3: R$ 1.61610240749971053e-3, R$ 0]
; most likely a Log10(x) table from 0 to 2 followed by some error
[<16 Log10_Table_T: R$ 0, R$ 0, R$ 6.83942453019881210e-003, R$ 1.06653844666057790e-013
R$ 1.33507081443440260e-002, R$ 8.67505934719954320e-014, R$ 1.99611018521181900e-002, R$ 9.23215523723840900e-014
R$ 2.62229227369061850e-002, R$ 7.49930509874774870e-014, R$ 3.25763513509400580e-002, R$ 2.41276323350416150e-014
R$ 3.90241079016959700e-002, R$ 1.07524865705891230e-014, R$ 4.50982556138797010e-002, R$ 2.01960064510954180e-014
R$ 5.12585643186866950e-002, R$ 3.16576133915421310e-014, R$ 5.70236199724831750e-002, R$ 2.44070710374291600e-014
R$ 6.31113911136935710e-002, R$ 2.48257788963393190e-014, R$ 6.87885240052992230e-002, R$ 1.09693456491489220e-013
R$ 7.45408528944153660e-002, R$ 8.48557345140544200e-014, R$ 8.03703965551676450e-002, R$ 5.64317690419344980e-014
R$ 8.60206705779091860e-002, R$ 2.11074530681352090e-014, R$ 9.14835662794075690e-002, R$ 2.48825492648481080e-014
R$ 9.70160548793046470e-002, R$ 8.88302993483350900e-014, R$ 1.02351435027458140e-001, R$ 8.15103304243699220e-014
R$ 1.07753177325776050e-001, R$ 4.45086172353454940e-014, R$ 1.12947822295495830e-001, R$ 3.08831178628687270e-015
R$ 1.18205353949292660e-001, R$ 3.88929326942825740e-014, R$ 1.23245578588807800e-001, R$ 4.71750903210203580e-014
R$ 1.28344985300145710e-001, R$ 6.57467750563177680e-014, R$ 1.33216699989134210e-001, R$ 2.71250821687622270e-014
R$ 1.38435254551609430e-001, R$ 7.56877432679591610e-015, R$ 1.43127205461155430e-001, R$ 6.81055570899612790e-015
R$ 1.48168577326714510e-001, R$ 6.02542674600232690e-014, R$ 1.52967460208515150e-001, R$ 2.83427723708756480e-014
R$ 1.57515087959154700e-001, R$ 1.09440764415528030e-013, R$ 1.62418959194383210e-001, R$ 5.35145729088875760e-014
R$ 1.67067178541742580e-001, R$ 5.99509971730059460e-014, R$ 1.71450865902443180e-001, R$ 1.13452388582535560e-013
R$ 1.76197300927015020e-001, R$ 3.28390775041250930e-015, R$ 1.80674603281659070e-001, R$ 1.03481913998834240e-013
R$ 1.85198545041771470e-001, R$ 3.73116362151814480e-014, R$ 1.89441967200082220e-001, R$ 2.98008559425539880e-014
R$ 1.93727260613627550e-001, R$ 8.13197508503595880e-014, R$ 1.98055259839406970e-001, R$ 3.57492486261199750e-014
R$ 2.02426824636404490e-001, R$ 7.53155364447138070e-014, R$ 2.06501548650066980e-001, R$ 7.07722033818675070e-014
R$ 2.10959407186123830e-001, R$ 1.06420585148621610e-013, R$ 2.15115366957320480e-001, R$ 6.74895867627637050e-014
R$ 2.18960252674605730e-001, R$ 6.67672190370376170e-014, R$ 2.23193863603228240e-001, R$ 1.36388742479585940e-014
R$ 2.27111265564531100e-001, R$ 2.32760258694916680e-014, R$ 2.31425484637043160e-001, R$ 2.92761620518569070e-014
R$ 2.35053696899512940e-001, R$ 6.25806885708624120e-014, R$ 2.39080054690248290e-001, R$ 3.00590055364650880e-014
R$ 2.43144090557620980e-001, R$ 1.05192270531529520e-014, R$ 2.46871963076841890e-001, R$ 3.27758102924169520e-014
R$ 2.50632111950153560e-001, R$ 2.79059969846129800e-014, R$ 2.54425100967296200e-001, R$ 2.43505807623931280e-014
R$ 2.58251508820308120e-001, R$ 6.53068360680055810e-014, R$ 2.62111929633533690e-001, R$ 7.78445532603097920e-014
R$ 2.65615893362905810e-001, R$ 1.97240087303329050e-014, R$ 2.69542633331980140e-001, R$ 6.12305485363238830e-014
R$ 2.73107313935042840e-001, R$ 3.18805618188936470e-014, R$ 2.76701495678366880e-001, R$ 1.05904780751833190e-013
R$ 2.80325670940214880e-001, R$ 4.14680805130042090e-014, R$ 2.83572747613220600e-001, R$ 1.90891161216058830e-014
R$ 2.87254964996350280e-001, R$ 1.65981890215736650e-014, R$ 2.90554464110186930e-001, R$ 4.83600159236775040e-014
R$ 2.94296613004917160e-001, R$ 9.56300328864571600e-014, R$ 2.97650255012513300e-001, R$ 8.72995849029738040e-014
R$ 3.01029995663952830e-001, R$ 2.83633945510449640e-014
Log10_Table_B: R$ (227328/524288), R$ (227328/524288), R$ (223776/524288), R$ (223776/524288)
R$ (220446/524288), R$ (220446/524288), R$ (217116/524288), R$ (217116/524288)
R$ (214008/524288), R$ (214008/524288), R$ (210900/524288), R$ (210900/524288)
R$ (207792/524288), R$ (207792/524288), R$ (204906/524288), R$ (204906/524288)
R$ (202020/524288), R$ (202020/524288), R$ (199356/524288), R$ (199356/524288)
R$ (196581/524288), R$ (196581/524288), R$ (194028/524288), R$ (194028/524288)
R$ (191475/524288), R$ (191475/524288), R$ (188922/524288), R$ (188922/524288)
R$ (186480/524288), R$ (186480/524288), R$ (184149/524288), R$ (184149/524288)
R$ (181818/524288), R$ (181818/524288), R$ (179598/524288), R$ (179598/524288)
R$ (177378/524288), R$ (177378/524288), R$ (175269/524288), R$ (175269/524288)
R$ (173160/524288), R$ (173160/524288), R$ (171162/524288), R$ (171162/524288)
R$ (169164/524288), R$ (169164/524288), R$ (167277/524288), R$ (167277/524288)
R$ (165279/524288), R$ (165279/524288), R$ (163503/524288), R$ (163503/524288)
R$ (161616/524288), R$ (161616/524288), R$ (159840/524288), R$ (159840/524288)
R$ (158175/524288), R$ (158175/524288), R$ (156399/524288), R$ (156399/524288)
R$ (154734/524288), R$ (154734/524288), R$ (153180/524288), R$ (153180/524288)
R$ (151515/524288), R$ (151515/524288), R$ (149961/524288), R$ (149961/524288)
R$ (148407/524288), R$ (148407/524288), R$ (146964/524288), R$ (146964/524288)
R$ (145521/524288), R$ (145521/524288), R$ (144078/524288), R$ (144078/524288)
R$ (142635/524288), R$ (142635/524288), R$ (141303/524288), R$ (141303/524288)
R$ (139860/524288), R$ (139860/524288), R$ (138528/524288), R$ (138528/524288)
R$ (137307/524288), R$ (137307/524288), R$ (135975/524288), R$ (135975/524288)
R$ (134754/524288), R$ (134754/524288), R$ (133422/524288), R$ (133422/524288)
R$ (132312/524288), R$ (132312/524288), R$ (131091/524288), R$ (131091/524288)
R$ (129870/524288), R$ (129870/524288), R$ (128760/524288), R$ (128760/524288)
R$ (127650/524288), R$ (127650/524288), R$ (126540/524288), R$ (126540/524288)
R$ (125430/524288), R$ (125430/524288), R$ (124320/524288), R$ (124320/524288)
R$ (123321/524288), R$ (123321/524288), R$ (122211/524288), R$ (122211/524288)
R$ (121212/524288), R$ (121212/524288), R$ (120213/524288), R$ (120213/524288)
R$ (119214/524288), R$ (119214/524288), R$ (118326/524288), R$ (118326/524288)
R$ (117327/524288), R$ (117327/524288), R$ (116439/524288), R$ (116439/524288)
R$ (115440/524288), R$ (115440/524288), R$ (114552/524288), R$ (114552/524288)
R$ (113664/524288), R$ (113664/524288)]
; Parameters flag
[SSE_EXP_INT 1
SSE_EXP_FLOAT 2
SSE_EXP_REAL8 4
SSE_EXP_QWORD 8]
; Values to return
[SSE_EXP_INVALID_PARAMETER 0-1] ; Invalid flag
[SSE_UNDERFLOW 0-2] ; The inputed number is underflow
[SSE_OVERFLOWN 0-3] ; The inputed number is overflow
[SSE_INFINITE 0-4] ; General error. The inputed number is infinite, or NAN etc
[SSE_ZERO 0-5] ; The inputed number is zero. Log and Ln cannot have this
[SSE_NEG_INFINITE 0-6] ; Negative Infinite found
[SSE_POS_INFINITE 0-7] ; Negative Infinite found
[SSE_NAN 0-9] ; NAN. Not a number
Proc Sse2_log10_precise:
Arguments @Number, @Flag
Uses edx, ecx
mov eax D@Flag
Test_if eax SSE_EXP_INT
cvtsi2sd xmm0 D@Number ; converts a signed integer to double
Test_Else_if eax SSE_EXP_FLOAT
cvtss2sd xmm0 X@Number ; converts a single precision float to double
Test_Else_if eax SSE_EXP_REAL8
mov eax D@Number | movsd XMM0 X$eax
Test_Else_if eax SSE_EXP_QWORD
mov eax D@Number | movq XMM0 X$eax
Test_Else
xor eax eax | ExitP ; return 0 Invalid parameter
Test_End
xor edx edx
movupd XMM1 XMM0
unpcklpd XMM0 XMM0
psrlq XMM1 52 | pextrw ecx XMM1 0 | and ecx 0FFF | sub ecx 1
...If ecx > 2045 ; Special cases. Number have some error
.SSE_D_If xmm0 <= X$Float_Zero ; Inputed value is zero
mov eax SSE_ZERO
SSE_D_If xmm0 < X$Float_Zero ; Inputed value is negatve
mov eax SSE_NEG_INFINITE
SSE_D_End
ExitP ; Exit the function
.SSE_D_Else
..If ecx = 0-1 ; number is denormalized. We can continue
mulsd XMM0 X$SSE_Two52 ; for very tinny numbers. Ex: x = 2e-314;XMM1
mov edx 0-52
xor eax eax
movupd XMM1 XMM0
unpcklpd XMM0 XMM0
psrlq XMM1 52 | pextrw ecx XMM1 0 | and ecx 0FFF | sub ecx 1
..Else
movupd XMM1 X$SSE_One ; same as SSE_One2
andpd XMM0 X$SSE_Emask ; same as SSE_Emask2
orpd XMM0 XMM1
cmpsd XMM1 XMM0 0 ; (EQ) error in rosasm
pextrw eax XMM1 0
If eax = 0 ; Not a Number
mov eax SSE_NAN
Else ; Number is positive infinite
mov eax SSE_POS_INFINITE
End_If
ExitP ; Exit the function
..End_If
.SSE_D_End
...End_If
movupd XMM1 X$SSE_Magic0 | andpd XMM0 X$SSE_Emask | orpd XMM0 X$SSE_One | addpd XMM1 XMM0
pextrw eax XMM1 0 | and eax ((Size_of_LogTable*2)-48)
movupd XMM4 X$Log10_Table_T+eax
movupd XMM1 X$SSE_HiMask0 | andpd XMM1 XMM0
subpd XMM0 XMM1 | mulpd XMM0 X$Log10_Table_B+eax
mulpd XMM1 X$Log10_Table_B+eax | subpd XMM1 X$SSE_LOG10_CC_0
addsd XMM4 XMM1
sub ecx 1022 | add ecx edx
cvtsi2sd XMM2 ecx | shl ecx 10 | add eax ecx | mov ecx 16 | mov edx 0 | cmp eax 0 | cmovz edx ecx
movupd XMM3 XMM0 | andpd XMM3 X$SSE_Place_Log2+edx
addpd XMM0 XMM1
; same as SSE_Place_Log0
unpcklpd XMM2 XMM2 | mulpd XMM2 X$SSE_Log1020 | addpd XMM2 XMM3 | addpd XMM4 XMM2
movupd XMM1 X$Log_Coeff0
movupd XMM2 XMM0 | mulpd XMM2 XMM2 | mulsd XMM2 XMM2 | mulsd XMM2 XMM0
mulpd XMM1 XMM0 | addpd XMM1 X$SSE_Log10Var1
mulpd XMM1 XMM0
addpd XMM1 X$SSE_Log10Var2 | mulpd XMM1 XMM2
movupd xmm2 X$SSE_Log10Var3 | mulpd XMM2 XMM0
movupd XMM3 XMM4
unpckhpd XMM3 XMM3
movupd XMM0 XMM1
addpd XMM1 XMM2
unpckhpd XMM0 XMM0
addsd XMM0 XMM1
addsd XMM0 XMM3
addsd XMM0 XMM4
EndP
My timmings
AMD Ryzen 5 2400G with Radeon Vega Graphics (SSE4)
2747 cycles for 100 * Sse2_log10_precise (Guga SSE2 Log10 precise , 2.7182818^5)
11641 cycles for 100 * ExpXY (MasmBasic, 2.7182818^5)
27174 cycles for 100 * pow (CRT, 2.7182818^5)
2619 cycles for 100 * Sse2_log10_precise (Guga SSE2 Log10 precise , 2.7182818^5)
11399 cycles for 100 * ExpXY (MasmBasic, 2.7182818^5)
27011 cycles for 100 * pow (CRT, 2.7182818^5)
2632 cycles for 100 * Sse2_log10_precise (Guga SSE2 Log10 precise , 2.7182818^5)
11207 cycles for 100 * ExpXY (MasmBasic, 2.7182818^5)
25996 cycles for 100 * pow (CRT, 2.7182818^5)
2593 cycles for 100 * Sse2_log10_precise (Guga SSE2 Log10 precise , 2.7182818^5)
11419 cycles for 100 * ExpXY (MasmBasic, 2.7182818^5)
26219 cycles for 100 * pow (CRT, 2.7182818^5)
2553 cycles for 100 * Sse2_log10_precise (Guga SSE2 Log10 precise , 2.7182818^5)
11310 cycles for 100 * ExpXY (MasmBasic, 2.7182818^5)
26940 cycles for 100 * pow (CRT, 2.7182818^5)
148.413159102577 for Sse2_log10_precise (Guga SSE2 Log10 precise , 2.7182818^5)
148.413159102577 for ExpXY (MasmBasic, 2.7182818^5)
148.413159102577 for pow (CRT, 2.7182818^5)
Updated new version with the proper values (in between parenthesis) calculated. File: TimmingsLog10g.zip (faster and still accurate)
Intel(R) Core(TM) i7-4930K CPU @ 3.40GHz (SSE4)
3398 cycles for 100 * Sse2_log10_precise (Guga SSE2 Log10 precise , 2.7182818^5)
12420 cycles for 100 * ExpXY (MasmBasic, 2.7182818^5)
24740 cycles for 100 * pow (CRT, 2.7182818^5)
3412 cycles for 100 * Sse2_log10_precise (Guga SSE2 Log10 precise , 2.7182818^5)
12403 cycles for 100 * ExpXY (MasmBasic, 2.7182818^5)
24733 cycles for 100 * pow (CRT, 2.7182818^5)
3402 cycles for 100 * Sse2_log10_precise (Guga SSE2 Log10 precise , 2.7182818^5)
12412 cycles for 100 * ExpXY (MasmBasic, 2.7182818^5)
24766 cycles for 100 * pow (CRT, 2.7182818^5)
3401 cycles for 100 * Sse2_log10_precise (Guga SSE2 Log10 precise , 2.7182818^5)
12403 cycles for 100 * ExpXY (MasmBasic, 2.7182818^5)
24788 cycles for 100 * pow (CRT, 2.7182818^5)
3411 cycles for 100 * Sse2_log10_precise (Guga SSE2 Log10 precise , 2.7182818^5)
12437 cycles for 100 * ExpXY (MasmBasic, 2.7182818^5)
24734 cycles for 100 * pow (CRT, 2.7182818^5)
148.413159102577 for Sse2_log10_precise (Guga SSE2 Log10 precise , 2.7182818^5)
148.413159102577 for ExpXY (MasmBasic, 2.7182818^5)
148.413159102577 for pow (CRT, 2.7182818^5)
Hi Marinus
You were right. I´m liking this SSE stuff :bgrin: :bgrin: :bgrin: I must confess, it´s a bit complicated at 1st but the results worth the efford :thumbsup: :thumbsup:
I´m giving a try on those faster functions to try building a dll for use with the image processing functions. This could be very usefull.
:thumbsup:
AMD A6-3500 APU with Radeon(tm) HD Graphics (SSE3)
5396 cycles for 100 * Sse2_log10_precise (Guga SSE2 Log10 precise )
11737 cycles for 100 * SmplMath: fSlv MyReal8 = log(MyExpo)
5388 cycles for 100 * Sse2_log10_precise (Guga SSE2 Log10 precise )
11746 cycles for 100 * SmplMath: fSlv MyReal8 = log(MyExpo)
5387 cycles for 100 * Sse2_log10_precise (Guga SSE2 Log10 precise )
11779 cycles for 100 * SmplMath: fSlv MyReal8 = log(MyExpo)
5447 cycles for 100 * Sse2_log10_precise (Guga SSE2 Log10 precise )
11731 cycles for 100 * SmplMath: fSlv MyReal8 = log(MyExpo)
5388 cycles for 100 * Sse2_log10_precise (Guga SSE2 Log10 precise )
11745 cycles for 100 * SmplMath: fSlv MyReal8 = log(MyExpo)
0.699076046207356 for Sse2_log10_precise (Guga SSE2 Log10 precise )
0.698970004336019 for SmplMath: fSlv MyReal8 = log(MyExpo)
:bgrin:
Does it return the result in xmm0?
call Sse2_log10_precise ; invoke Sse2_log10_precise, 5, 1
movlps MyReal8, xmm0
Apparently is faster :
movdqu oword ptr MyReal8, xmm0
LATER: :biggrin:
Quote from: jj2007 on August 16, 2020, 08:00:03 AM
movlps MyReal8, xmm0
perhaps is:
movlpd MyReal8, xmm0
AMD A6-3500 APU with Radeon(tm) HD Graphics (SSE3)
5210 cycles for 100 * Sse2_log10_precise (Guga SSE2 Log10 precise )
11760 cycles for 100 * SmplMath: fSlv MyReal8 = log(MyExpo)
5217 cycles for 100 * Sse2_log10_precise (Guga SSE2 Log10 precise )
11750 cycles for 100 * SmplMath: fSlv MyReal8 = log(MyExpo)
5213 cycles for 100 * Sse2_log10_precise (Guga SSE2 Log10 precise )
11758 cycles for 100 * SmplMath: fSlv MyReal8 = log(MyExpo)
5228 cycles for 100 * Sse2_log10_precise (Guga SSE2 Log10 precise )
11796 cycles for 100 * SmplMath: fSlv MyReal8 = log(MyExpo)
5215 cycles for 100 * Sse2_log10_precise (Guga SSE2 Log10 precise )
11820 cycles for 100 * SmplMath: fSlv MyReal8 = log(MyExpo)
0.699076046207356 for Sse2_log10_precise (Guga SSE2 Log10 precise )
0.698970004336019 for SmplMath: fSlv MyReal8 = log(MyExpo)
Quote from: jj2007 on August 16, 2020, 08:00:03 AM
Does it return the result in xmm0?
call Sse2_log10_precise ; invoke Sse2_log10_precise, 5, 1
movlps MyReal8, xmm0
Hi JJ
Yes, the result is in xmm0
If i ported it correctly, it should return this:
(https://i.ibb.co/F3bBsQV/sfd-Image1.png) (https://ibb.co/C85nJj2)
Quote from: guga on August 16, 2020, 08:30:25 AM
Quote from: jj2007 on August 16, 2020, 08:00:03 AM
Does it return the result in xmm0?
call Sse2_log10_precise ; invoke Sse2_log10_precise, 5, 1
movlps MyReal8, xmm0
Hi JJ
Yes, the result is in xmm0
If i ported it correctly, it should return this:
Just add the
movlps MyReal8, xmm0, and you will see the result at the end.
movlpd
Quote from: HSE on August 16, 2020, 08:58:08 AM
movlpd
Most if not all sse* mov instructions don't care what format they are moving. I use movlps because it's one byte shorter.
Quote from: jj2007 on August 16, 2020, 08:38:17 AM
Quote from: guga on August 16, 2020, 08:30:25 AM
Quote from: jj2007 on August 16, 2020, 08:00:03 AM
Does it return the result in xmm0?
call Sse2_log10_precise ; invoke Sse2_log10_precise, 5, 1
movlps MyReal8, xmm0
Hi JJ
Yes, the result is in xmm0
If i ported it correctly, it should return this:
Just add the movlps MyReal8, xmm0, and you will see the result at the end.
Hi JJ. the value is showed incorrectly. I guess i didn´t port it properly.
Most likely is on the Data variables.
Log10_Table_T and Log10_Table_B are part of the same Table. So, basically it is a Array divided by 2, but when i ported to masm, the Values of Log10_Table_B was gone :dazzled: :dazzled: :dazzled:
When i ported as:
Log10_Table_B dq (227328/524288), (227328/524288), (223776/524288), (223776/524288)
dq (220446/524288), (220446/524288), (217116/524288), (217116/524288)
dq (214008/524288), (214008/524288), (210900/524288), (210900/524288)
dq (207792/524288), (207792/524288), (204906/524288), (204906/524288)
dq (202020/524288), (202020/524288), (199356/524288), (199356/524288)
(...)
All those values between parenthesis was zeroed. Masm compiled it as:
Log10_Table_B xmmword 41h dup(0)
Why ?
How can i make masm calculate the values of the Qwords, such as 22046/524288 ? I need to remove the parenthesis ?
Quote from: jj2007 on August 16, 2020, 09:46:48 AM
Most if not all sse* mov instructions don't care what format they are moving. I use movlps because it's one byte shorter.
Well JJ, you are lucky: your machine is smarter than mine!
movlps :
9744 cycles for 100 * Sse2_log10_precise (Guga SSE2 Log10 precise )
molpd :
5192 cycles for 100 * Sse2_log10_precise (Guga SSE2 Log10 precise )
movdqu oword ptr MyReal8, xmm0 :
5384 cycles for 100 * Sse2_log10_precise (Guga SSE2 Log10 precise
That looks odd, can you post the executables? The call to the routine is a hundred times slower than the movlps/movpld :cool:
Let's test it:
Intel(R) Core(TM) i5-2450M CPU @ 2.50GHz (SSE4)
25285 cycles for 10000 * movlps
25166 cycles for 10000 * movlpd
25808 cycles for 10000 * movdqu
25155 cycles for 10000 * movlps
25053 cycles for 10000 * movlpd
25257 cycles for 10000 * movdqu
25271 cycles for 10000 * movlps
24811 cycles for 10000 * movlpd
25296 cycles for 10000 * movdqu
25034 cycles for 10000 * movlps
25595 cycles for 10000 * movlpd
25706 cycles for 10000 * movdqu
24902 cycles for 10000 * movlps
25345 cycles for 10000 * movlpd
25546 cycles for 10000 * movdqu
28 bytes for movlps
32 bytes for movlpd
32 bytes for movdqu
R8 1234567890.123456716
R8 1234567890.123456716
R8 1234567890.123456716
It's the rutine with the mov .
The code its above, just change the line 345.
Quote from: HSE on August 16, 2020, 10:31:42 AM
It's the rutine with the mov .
The code its above, just change the line 345.
Please show the line in context, I can't find it.
Guys. The syntax of tableB is wrong. How to make masm calculate the values in parenthesis in the data section ?
Log10_Table_B dq (227328/524288), (227328/524288), <----- this is causing masm to compile as db 000000000 rather then the values of each division
OK, Guys...Now it is working as expected. The result is ok.
All i did was replace the values of the variables in between ( ) with their calculated values and it returned the correct answer as expected.
Attached update. The src is the same as before, except the fix of the values in parenthesis. Updated the 1st post too
Result is:
AMD Ryzen 5 2400G with Radeon Vega Graphics (SSE4)
2786 cycles for 100 * Sse2_log10_precise (Guga SSE2 Log10 precise of 5)
11317 cycles for 100 * ExpXY (MasmBasic, 2.7182818^5)
26628 cycles for 100 * pow (CRT, 2.7182818^5)
2662 cycles for 100 * Sse2_log10_precise (Guga SSE2 Log10 precise of 5)
11268 cycles for 100 * ExpXY (MasmBasic, 2.7182818^5)
27312 cycles for 100 * pow (CRT, 2.7182818^5)
2686 cycles for 100 * Sse2_log10_precise (Guga SSE2 Log10 precise of 5)
11511 cycles for 100 * ExpXY (MasmBasic, 2.7182818^5)
26168 cycles for 100 * pow (CRT, 2.7182818^5)
2832 cycles for 100 * Sse2_log10_precise (Guga SSE2 Log10 precise of 5)
12049 cycles for 100 * ExpXY (MasmBasic, 2.7182818^5)
27194 cycles for 100 * pow (CRT, 2.7182818^5)
2781 cycles for 100 * Sse2_log10_precise (Guga SSE2 Log10 precise of 5)
11254 cycles for 100 * ExpXY (MasmBasic, 2.7182818^5)
26377 cycles for 100 * pow (CRT, 2.7182818^5)
0.698970004336019 for Sse2_log10_precise (Guga SSE2 Log10 precise of 5)
148.413159102577 for ExpXY (MasmBasic, 2.7182818^5)
148.413159102577 for pow (CRT, 2.7182818^5)
--- ok ---
AMD Ryzen 5 3400G with Radeon Vega Graphics (SSE4)
26512 cycles for 10000 * movlps
26975 cycles for 10000 * movlpd
62428 cycles for 10000 * movdqu
26448 cycles for 10000 * movlps
26523 cycles for 10000 * movlpd
62479 cycles for 10000 * movdqu
26537 cycles for 10000 * movlps
26812 cycles for 10000 * movlpd
62242 cycles for 10000 * movdqu
26604 cycles for 10000 * movlps
26673 cycles for 10000 * movlpd
62401 cycles for 10000 * movdqu
26578 cycles for 10000 * movlps
26707 cycles for 10000 * movlpd
62530 cycles for 10000 * movdqu
28 bytes for movlps
32 bytes for movlpd
32 bytes for movdqu
R8 1234567890.123456716
R8 1234567890.123456716
R8 1234567890.123456716
-
Quote from: jj2007 on August 16, 2020, 10:45:44 AM
Please show the line in context, I can't find it.
:biggrin: Have you problems with the TestBed?
(aclaration: the TestBed program was written by jj2007)
Quote from: TimoVJL on August 16, 2020, 05:18:16 PM
AMD Ryzen 5 3400G with Radeon Vega Graphics (SSE4)
Only SS3 here! Perhaps that is.
Quote from: guga on August 16, 2020, 10:57:53 AM
Guys. The syntax of tableB is wrong. How to make masm calculate the values in parenthesis in the data section ?
Log10_Table_B dq (227328/524288), (227328/524288), <----- this is causing masm to compile as db 000000000 rather then the values of each division
You can use qWord's floating point arithmetic while assembling: MREAL-macros (http://masm32.com/board/index.php?topic=3225.msg33774#msg33774)
from what I gather you use tables to calculate the logarithm, just for fun, using Maple I computed a rational approximation to ln(1+x) and by multiplying that by log10(e) you get an approximation to log10(1+x) with x between 0 and 1
ln(1+x) = (1+(2.45432048794495419+(2.17440739254242255+(.841813647259988564+(.135819954481562186+(0.687077985071464530e-2+0.231261455529019880e-4*x)*x)*x)*x)*x)*x)*x/(1+(2.95432048794495173+(3.31823430318176310+(1.76615730286299464+(.451400626940937546+(0.492131362215352252e-1+0.158489557252300951e-2*x)*x)*x)*x)*x)*x)
log10(1+x) = (.434294481903251828+(1.06589784473659010+(.944333131990812123+(.365595021795863521+(0.589858567636932956e-1+(0.298394177553741882e-2+0.100435574013167601e-4*x)*x)*x)*x)*x)*x)*x/(1+(2.95432048794495173+(3.31823430318176310+(1.76615730286299464+(.451400626940937546+(0.492131362215352252e-1+0.158489557252300951e-2*x)*x)*x)*x)*x)*x)
Quote from: jack on August 17, 2020, 03:18:10 AM
from what I gather you use tables to calculate the logarithm, just for fun, using Maple I computed a rational approximation to ln(1+x) and by multiplying that by log10(e) you get an approximation to log10(1+x) with x between 0 and 1
ln(1+x) = (1+(2.45432048794495419+(2.17440739254242255+(.841813647259988564+(.135819954481562186+(0.687077985071464530e-2+0.231261455529019880e-4*x)*x)*x)*x)*x)*x)*x/(1+(2.95432048794495173+(3.31823430318176310+(1.76615730286299464+(.451400626940937546+(0.492131362215352252e-1+0.158489557252300951e-2*x)*x)*x)*x)*x)*x)
log10(1+x) = (.434294481903251828+(1.06589784473659010+(.944333131990812123+(.365595021795863521+(0.589858567636932956e-1+(0.298394177553741882e-2+0.100435574013167601e-4*x)*x)*x)*x)*x)*x)*x/(1+(2.95432048794495173+(3.31823430318176310+(1.76615730286299464+(.451400626940937546+(0.492131362215352252e-1+0.158489557252300951e-2*x)*x)*x)*x)*x)*x)
Great, jack. Thanks a lot.
Some questions:
1 - What is maple ? (Where can i download it)
2 - Did you tested the accuracy ? What is the precision of the values you inputted, and also can it handle denormalized values too ?
3 - Can mapple produces these polynomial values without division ? Division is slow.
Hi Guga,
You can always avoid divisions when using constants.
Make a reciprocal of the constant and you can multiply instead of divide.
Value real4 256.0
recValue real4 0.00390625 (1/256)
divss xmm0,real4 ptr Value
mulss xmm0,real4 ptr recValue
both have the same result.
1- maple is a Computer Algebra System https://www.maplesoft.com/ns/maple/cas/computer-algebra-systems-math-education.aspx
2- the input range for x is between 0 and 1, the error is probably +/-1e-16 it depends on the precision used in the evaluation, as for de-normalized values - this is just a rational polynomial approximation unrelated to floating point intrinsic
3- yes, but a rational polynomial usually requires fewer terms to achieve the precision than a plain polynomial, it would take a polynomial of degree 19 or 20 to get the same precision
Quote from: Siekmanski on August 17, 2020, 05:45:59 AM
Make a reciprocal of the constant and you can multiply instead of divide.
Multiply is more eficient than divide :thumbsup:
Integer division using reciprocals -- Robert Alverson
https://www.computer.org/csdl/proceedings-article/arith/1991/00145558/12OmNyaXPS1
Quote from: HSE on August 16, 2020, 07:47:28 AM
:thumbsup:
AMD A6-3500 APU with Radeon(tm) HD Graphics (SSE3)
5396 cycles for 100 * Sse2_log10_precise (Guga SSE2 Log10 precise )
11737 cycles for 100 * SmplMath: fSlv MyReal8 = log(MyExpo)
Intel(R) Core(TM) i5-2450M CPU @ 2.50GHz (SSE4)
3163 cycles for 100 * Sse2_log10_precise (Guga SSE2 Log10 precise )
5060 cycles for 100 * Log10
3211 cycles for 100 * Sse2_log10_precise (Guga SSE2 Log10 precise )
4933 cycles for 100 * Log10
3209 cycles for 100 * Sse2_log10_precise (Guga SSE2 Log10 precise )
4823 cycles for 100 * Log10
3186 cycles for 100 * Sse2_log10_precise (Guga SSE2 Log10 precise )
4773 cycles for 100 * Log10
3219 cycles for 100 * Sse2_log10_precise (Guga SSE2 Log10 precise )
4739 cycles for 100 * Log10
536 bytes for Sse2_log10_precise (Guga SSE2 Log10 precise )
16 bytes for Log10
0.699076046207356 for Sse2_log10_precise (Guga SSE2 Log10 precise )
0.698970004336019 for Log10
0.6989700043360188 expected
With last Guga correction:
AMD A6-3500 APU with Radeon(tm) HD Graphics (SSE3)
5257 cycles for 100 * Sse2_log10_precise (Guga SSE2 Log10 precise )
11751 cycles for 100 * SmplMath: fSlv MyReal8 = log(MyExpo)
11687 cycles for 100 * JJ Log10
5251 cycles for 100 * Sse2_log10_precise (Guga SSE2 Log10 precise )
11741 cycles for 100 * SmplMath: fSlv MyReal8 = log(MyExpo)
11637 cycles for 100 * JJ Log10
5254 cycles for 100 * Sse2_log10_precise (Guga SSE2 Log10 precise )
11758 cycles for 100 * SmplMath: fSlv MyReal8 = log(MyExpo)
11664 cycles for 100 * JJ Log10
5248 cycles for 100 * Sse2_log10_precise (Guga SSE2 Log10 precise )
11760 cycles for 100 * SmplMath: fSlv MyReal8 = log(MyExpo)
11641 cycles for 100 * JJ Log10
5274 cycles for 100 * Sse2_log10_precise (Guga SSE2 Log10 precise )
11735 cycles for 100 * SmplMath: fSlv MyReal8 = log(MyExpo)
11653 cycles for 100 * JJ Log10
0.698970004336019 for Sse2_log10_precise (Guga SSE2 Log10 precise )
0.698970004336019 for SmplMath: fSlv MyReal8 = log(MyExpo)
0.698970004336019 for JJ Log10
MB Log10 under the hood:
fldlg2
fld MyExpo
fyl2x
fstp MyReal8
[delete]
Hi MArinus
With the equation provided by Jack, we can´t make a reciprocal, unfortunately. The divisor is another polynomial.
Jack, i gave a try using Log10(5+1) with the formula and the precision is something around 5 digits after the "." only. If you input 5 as x, the result of log(x+1) = log(6) will turn onto:
(0.434294481903251828+(1.0658978447365901+(0.944333131990812123+(0.365595021795863521+(0.589858567636932956e-1+(0.298394177553741882e-2+0.100435574013167601e-4*x)*x)*x)*x)*x)*x)*x, x=5
= 607.096994200488817
(1+(2.95432048794495173+(3.31823430318176310+(1.76615730286299464+(0.451400626940937546+(0.492131362215352252e-1+0.158489557252300951e-2*x)*x)*x)*x)*x)*x) , x=5
= 780.177558728198735
607.096994200488817/780.177558728198735 = 0.778152341615854664455814528055710104487791345037987518732
Expected log10(6) = 0.7781512503836436325087667979796083359683187456528044061402931014.
0.778152341615854664455814528055710104487791345037987518732 ; result using maple
0.7781512503836436325087667979796083359683187456528044061402931014 ; expected result
Can you please give a try to see if the precision can be extended to at least 14 digits after the "." and also keeping the amount of polynomial "x" to be used (or simplifying would be better) ? I mean it is a equation where the numerator is a equation on the form of x^7+x^6+... and the divisor x^6+x^5... It can be reformulated as:
(https://i.ibb.co/0mTnQ11/gfds-Image2.png) (https://imgbb.com/)
where A, B, C...are the values you posted and "x" the inputed value to calculate. We could try to put the numerator on a matrix and the divisor on other matrix and try to divide the matrix using the inversal of the divisor (The equation with x^6+...), but calculating the inverse matrix and also needing to check later if it can be divided will take a lot of time to process too.
Please, see if the numbers you created with Maple can be extended to at least 14 digits precision (after the "."), and also try simplifying the equation so we can try to see if it´s faster then the one i made.
Quote from: jj2007 on August 17, 2020, 10:20:09 AM
Quote from: HSE on August 16, 2020, 07:47:28 AM
:thumbsup:
AMD A6-3500 APU with Radeon(tm) HD Graphics (SSE3)
5396 cycles for 100 * Sse2_log10_precise (Guga SSE2 Log10 precise )
11737 cycles for 100 * SmplMath: fSlv MyReal8 = log(MyExpo)
Intel(R) Core(TM) i5-2450M CPU @ 2.50GHz (SSE4)
3163 cycles for 100 * Sse2_log10_precise (Guga SSE2 Log10 precise )
5060 cycles for 100 * Log10
3211 cycles for 100 * Sse2_log10_precise (Guga SSE2 Log10 precise )
4933 cycles for 100 * Log10
3209 cycles for 100 * Sse2_log10_precise (Guga SSE2 Log10 precise )
4823 cycles for 100 * Log10
3186 cycles for 100 * Sse2_log10_precise (Guga SSE2 Log10 precise )
4773 cycles for 100 * Log10
3219 cycles for 100 * Sse2_log10_precise (Guga SSE2 Log10 precise )
4739 cycles for 100 * Log10
536 bytes for Sse2_log10_precise (Guga SSE2 Log10 precise )
16 bytes for Log10
0.699076046207356 for Sse2_log10_precise (Guga SSE2 Log10 precise )
0.698970004336019 for Log10
0.6989700043360188 expected
Hi JJ.
I think you used the older version. The new one produces the correct value:
0.698970004336019 for Sse2_log10_precise (Guga SSE2 Log10 precise of 5)
The error is on the last 2 digits only :) Your version have less, digits then mine. So, on RosAsm, the full number (18 digits) is: 0.698970004336018857 with only the last 2 digits that looses precision.
Please, try with FastLog10a.zip i posted a few comments earlier :)
I succeeded to optimize a bit further, but i´m trying to do the proper math to optimize it even more and avoid using so many registers. I need now to optimize this part (This modification below also produces the correct result and gained extra speed):
(...)
; same as SSE_Place_Log0
unpcklpd XMM2 XMM2 | mulpd XMM2 X$SSE_Log1020 | addpd XMM2 XMM3 | addpd XMM4 XMM2
movupd XMM1 X$Log_Coeff0
movupd XMM2 XMM0 | mulpd XMM2 XMM2 | mulsd XMM2 XMM2 | mulsd XMM2 XMM0
mulpd XMM1 XMM0 | addpd XMM1 X$SSE_Log10Var1
mulpd XMM1 XMM0
addpd XMM1 X$SSE_Log10Var2 | mulpd XMM1 XMM2
movupd xmm2 X$SSE_Log10Var3 | mulpd XMM2 XMM0
;--------------------
; all necessary data are stored in both Packed Double paisr in registers xmm4, xmm1 and xmm2. Modified original version. With shuffle is faster
; We only need to sum all of them
addpd xmm1 xmm4 ; sum all double quads from xmm1 and xmm4. xmm1 = xmm1+xmm4
addpd xmm1 xmm2 ; sum both doubles of the result above with both doubles of xmm2. xmm1 => xmm1+xmm4+xmm2
movupd xmm0 xmm1 ; Now we need to sum both doubles of the resultrant register in xmm1. To do that we copy xmm1 to xmm0
pshufd XMM0 XMM0 SSE_SWAP_QWORDS ; and swap their pairs of doubles. On this way we have in xmm0 and xmm1 the inverted order of the pairs. SSE_SWAP_QWORDS = 78
addpd xmm0 xmm1 ; and we only need now to sum them up. xmm0 = xmm1 => xmm0 = xmm1+xmm4+xmm2
The math to do this is the formula below ("Lo" and "Hi" are the low and Hi parts of the Double quadword on each register or variable):
Hiquadxmm0 = xmm0.Lo^2 * ( (Log_Coeff0.Lo*xmm0.Lo+SSE_Log10Var1.Lo)* xmm0.Lo+ SSE_Log10Var2.Lo) + xmm4.Lo + (Log10Var3.Lo*xmm0.Lo)
Loquadxmm0 = xmm0.Hi^5 * ( (Log_Coeff0.Hi*xmm0.Hi+SSE_Log10Var1.Hi)* xmm0.Hi+ SSE_Log10Var2.Hi) + xmm4.Hi + (Log10Var3.Hi*xmm0.Hi)
Final xmm0 = Hiquadxmm0 + Loquadxmm0
I´m trying to recreate the above equations using less registers (forcing it to use only xmm0 and xmm1 and xmm2) and also keeping the speed, with shorter computations, but it is hard to find the proper way.
I tried this below, but the result is incorrect. I´m missing something.
;---------------------------- This is good for speed, but i´m missing some calculation, because it now produced a incorrect value.
; 1st step xmm4.Hi + (Log10Var3.Hi*xmm0.Hi)
movupd xmm2 X$SSE_Log10Var3 | mulpd XMM2 XMM0 | addpd xmm2 xmm4
; ( (Log_Coeff0.Hi*xmm0.Hi+SSE_Log10Var1.Hi)* xmm0.Hi+ SSE_Log10Var2.Hi)
movupd XMM1 X$Log_Coeff0 | mulpd XMM1 XMM0 | addpd XMM1 X$SSE_Log10Var2
; since xmm0.hi and xmm.lo are parto f the same xmm0 we can simply multiply this to get
; Lo and Hi Quad xmm0. For the Lo Quad we must then simply mul by Loquad mnore 3 times to get xmm0.Hi^5
mulpd XMM1 XMM0 | mulpd XMM1 XMM0
; ok, now we get the Hi Part xmm0.Lo^2 * ( (Log_Coeff0.Lo*xmm0.Lo+SSE_Log10Var1.Lo)* xmm0.Lo+ SSE_Log10Var2.Lo)
; To we get the LoPart we do:
mulsd xmm1 xmm0 | mulsd xmm1 xmm0 | mulsd xmm1 xmm0
; Now we can finally add xmm2 to xmm1
addpd xmm1 xmm2
; And finally we Exchange the data to add both hi and lo quads
movupd xmm0 xmm1 ; Now we need to sum both doubles of the resultrant register in xmm1. To do that we copy xmm1 to xmm0
pshufd XMM0 XMM0 SSE_SWAP_QWORDS ; and swap their pairs of doubles. On this way we have in xmm0 and xmm1 the inverted order of the pairs
addpd xmm0 xmm1 ; and we only need now to sum them up. xmm0 = xmm1 => xmm0 = xmm1+xmm4+xmm2
Hi Guga,
You can create a reciprocal with these SSE2 instructions:
rcpps - Approximates the reciprocal of 4 packed floats.
rcpss - Approximates the reciprocal of a single float.
Hi marinus.
Great :thumbsup: :thumbsup: :thumbsup: :thumbsup:
But, how to apply it on the previous code to we minimize the usage of so many registers and make it shorter and faster ? (Mine version does not use divisions)
I´m trying to simplify even further this:
Hiquadxmm0 = xmm0.Lo^2 * ( (Log_Coeff0.Lo*xmm0.Lo+SSE_Log10Var1.Lo)* xmm0.Lo+ SSE_Log10Var2.Lo) + xmm4.Lo + (Log10Var3.Lo*xmm0.Lo)
Loquadxmm0 = xmm0.Hi^5 * ( (Log_Coeff0.Hi*xmm0.Hi+SSE_Log10Var1.Hi)* xmm0.Hi+ SSE_Log10Var2.Hi) + xmm4.Hi + (Log10Var3.Hi*xmm0.Hi)
Final xmm0 = Hiquadxmm0 + Loquadxmm0
But, didn´t suceeded yet
Just noticed you use real8.
unfortunately there are no rcppd and rcpsd instructions.
Ok, I think i got it working as expected.
I had to do this:
; The formula to retrieve xmm0 is as follows:
; Hiquadxmm0 = xmm0.Lo^2 * ( (Log_Coeff0.Lo*xmm0.Lo+SSE_Log10Var1.Lo)* xmm0.Lo+ SSE_Log10Var2.Lo) + xmm4.Lo + (Log10Var3.Lo*xmm0.Lo)
; Loquadxmm0 = xmm0.Hi^5 * ( (Log_Coeff0.Hi*xmm0.Hi+SSE_Log10Var1.Hi)* xmm0.Hi+ SSE_Log10Var2.Hi) + xmm4.Hi + (Log10Var3.Hi*xmm0.Hi)
; Final xmm0 = Hiquadxmm0 + Loquadxmm0
; 1st step we calculate low and hi values of xmm4.Hi + (Log10Var3.Hi*xmm0.Hi) and xmm4.Lo + (Log10Var3.Lo*xmm0.Lo)
movupd xmm2 X$SSE_Log10Var3 | mulpd XMM2 XMM0 | addpd xmm2 xmm4 ; Save the result in xmm2
; Now. What do both have in common ? Both have in common this:
; xmm0.Lo * ( (Log_Coeff0.Lo*xmm0.Lo+SSE_Log10Var1.Lo)* xmm0.Lo+ SSE_Log10Var2.Lo)
; xmm0.Hi * ( (Log_Coeff0.Hi*xmm0.Hi+SSE_Log10Var1.Hi)* xmm0.Hi+ SSE_Log10Var2.Hi)
; So we compute this only once and save it on a xmm1 register to we get both low and hi values
; ( (Log_Coeff0.Hi*xmm0.Hi+SSE_Log10Var1.Hi)* xmm0.Hi+ SSE_Log10Var2.Hi)
movupd XMM1 X$Log_Coeff0 | mulpd XMM1 XMM0 | addpd XMM1 X$SSE_Log10Var1 | mulpd xmm1 xmm0 | addpd XMM1 X$SSE_Log10Var2
; now we get the LoQuad xmm0.Lo^2 * ( (Log_Coeff0.Lo*xmm0.Lo+SSE_Log10Var1.Lo)* xmm0.Lo+ SSE_Log10Var2.Lo)
; and the hi quad xmm0.Lo^2 * ( (Log_Coeff0.Lo*xmm0.Lo+SSE_Log10Var1.Lo)* xmm0.Lo+ SSE_Log10Var2.Lo)
; at once. (Because they share the same values of xmm0.Lo^2)
mulpd xmm1 xmm0 | mulpd xmm1 xmm0; xmm0.Lo^2 is calculated and applied in both low and hi doubles of xmm1
; let´ now do the same for xmm0.Hi^5
; since we already calculated xmm0.Hi^2, we need only to multiply 3 times with xmm0 to we get the value of Loquadxmm0
mulsd xmm1 xmm0 | mulsd xmm1 xmm0 | mulsd xmm1 xmm0
; Now we can finally add xmm2 to xmm1
addpd xmm1 xmm2
; And finally we Exchange the data to add both hi and lo quads
movupd xmm0 xmm1 ; Now we need to sum both doubles of the resultrant register in xmm1. To do that we copy xmm1 to xmm0
pshufd xmm0 xmm0 SSE_SWAP_QWORDS ; and swap their pairs of doubles. On this way we have in xmm0 and xmm1 the inverted order of the pairs. This is a simple equate whose value is 78 to perform the swap from pshufd
addpd xmm0 xmm1 ; and we only need now to sum them up. xmm0 = xmm1 => xmm0 = xmm1+xmm4+xmm2
Although i removed extra registers, on this part of the code, it still need 17 instructions to calculate the final result at xmm0
I wonder if this can be optimized even further in SSE2
Quote from: guga on August 17, 2020, 05:20:27 PMHi JJ.
I think you used the older version. The new one produces the correct value:
0.698970004336019 for Sse2_log10_precise (Guga SSE2 Log10 precise of 5)
The error is on the last 2 digits only :) Your version have less, digits then mine. So, on RosAsm, the full number (18 digits) is: 0.698970004336018857 with only the last 2 digits that looses precision.
Please, try with FastLog10a.zip i posted a few comments earlier :)
The only changes necessary are useC=0 and in TestB, line 357:
SetFloat MyReal8=Log10(MyExpo)instead of ExpXY(MyBaseB, MyExpo, MyReal8)
Intel(R) Core(TM) i5-2450M CPU @ 2.50GHz (SSE4)
3157 cycles for 100 * Sse2_log10_precise (Guga SSE2 Log10 precise of 5)
4795 cycles for 100 * MasmBasic Log10
3166 cycles for 100 * Sse2_log10_precise (Guga SSE2 Log10 precise of 5)
4830 cycles for 100 * MasmBasic Log10
3145 cycles for 100 * Sse2_log10_precise (Guga SSE2 Log10 precise of 5)
4772 cycles for 100 * MasmBasic Log10
3212 cycles for 100 * Sse2_log10_precise (Guga SSE2 Log10 precise of 5)
4767 cycles for 100 * MasmBasic Log10
3145 cycles for 100 * Sse2_log10_precise (Guga SSE2 Log10 precise of 5)
4765 cycles for 100 * MasmBasic Log10
0.698970004336019 for Sse2_log10_precise (Guga SSE2 Log10 precise of 5)
0.698970004336019 for MasmBasic Log10
We are close :smiley:
As regards precision:
include \masm32\MasmBasic\MasmBasic.inc ; download (http://masm32.com/board/index.php?topic=94.0)
SetGlobals MyReal10:REAL10
Init
PrintLine "0.698970004336018804786 (expected)"
SetFloat MyReal10=Log10(5.0)
Print Str$(MyReal10)
EndOfCode0.698970004336018804786 (expected)
0.6989700043360188048
Quote from: jj2007 on August 17, 2020, 09:17:56 PM
Quote from: guga on August 17, 2020, 05:20:27 PMHi JJ.
I think you used the older version. The new one produces the correct value:
0.698970004336019 for Sse2_log10_precise (Guga SSE2 Log10 precise of 5)
The error is on the last 2 digits only :) Your version have less, digits then mine. So, on RosAsm, the full number (18 digits) is: 0.698970004336018857 with only the last 2 digits that looses precision.
Please, try with FastLog10a.zip i posted a few comments earlier :)
The only changes necessary are useC=0 and in TestB, line 357:
SetFloat MyReal8=Log10(MyExpo)
instead of ExpXY(MyBaseB, MyExpo, MyReal8)
Intel(R) Core(TM) i5-2450M CPU @ 2.50GHz (SSE4)
3157 cycles for 100 * Sse2_log10_precise (Guga SSE2 Log10 precise of 5)
4795 cycles for 100 * MasmBasic Log10
3166 cycles for 100 * Sse2_log10_precise (Guga SSE2 Log10 precise of 5)
4830 cycles for 100 * MasmBasic Log10
3145 cycles for 100 * Sse2_log10_precise (Guga SSE2 Log10 precise of 5)
4772 cycles for 100 * MasmBasic Log10
3212 cycles for 100 * Sse2_log10_precise (Guga SSE2 Log10 precise of 5)
4767 cycles for 100 * MasmBasic Log10
3145 cycles for 100 * Sse2_log10_precise (Guga SSE2 Log10 precise of 5)
4765 cycles for 100 * MasmBasic Log10
0.698970004336019 for Sse2_log10_precise (Guga SSE2 Log10 precise of 5)
0.698970004336019 for MasmBasic Log10
We are close :smiley:
As regards precision:
include \masm32\MasmBasic\MasmBasic.inc ; download (http://masm32.com/board/index.php?topic=94.0)
SetGlobals MyReal10:REAL10
Init
PrintLine "0.698970004336018804786 (expected)"
SetFloat MyReal10=Log10(5.0)
Print Str$(MyReal10)
EndOfCode
0.698970004336018804786 (expected)
0.6989700043360188048
Great !. Thanks, JJ. I´ll change the code and compare to yours. I´m trying to see if i can optimize a bit further before post a newer version :)
New version.
A bit faster (Something around 10% faster). :thumbsup:
I think i reached my limit of optimization :bgrin: :bgrin:. If someone wants to give a try and optimize further, please do :greensml: :greensml: :greensml:
Accuracy not affected by the current optimization.
Precision is:
16 digits after the "." for normalized values
12 digits after the "." for denormalized values
My timmings:
AMD Ryzen 5 2400G with Radeon Vega Graphics (SSE4)
2640 cycles for 100 * Sse2_log10_precise (Guga SSE2 Log10 precise of 5)
6699 cycles for 100 * Log10 (MasmBasic, Log10 of 5)
2507 cycles for 100 * Sse2_log10_precise (Guga SSE2 Log10 precise of 5)
6765 cycles for 100 * Log10 (MasmBasic, Log10 of 5)
2471 cycles for 100 * Sse2_log10_precise (Guga SSE2 Log10 precise of 5)
6774 cycles for 100 * Log10 (MasmBasic, Log10 of 5)
2516 cycles for 100 * Sse2_log10_precise (Guga SSE2 Log10 precise of 5)
6785 cycles for 100 * Log10 (MasmBasic, Log10 of 5)
2569 cycles for 100 * Sse2_log10_precise (Guga SSE2 Log10 precise of 5)
6600 cycles for 100 * Log10 (MasmBasic, Log10 of 5)
0.698970004336019 for Sse2_log10_precise (Guga SSE2 Log10 precise of 5)
0.698970004336019 for Log10 (MasmBasic, Log10 of 5)
--- ok ---
Updated version also on the 1st post
:thumbsup:
Intel(R) Core(TM) i5-2450M CPU @ 2.50GHz (SSE4)
2867 cycles for 100 * Sse2_log10_precise (Guga SSE2 Log10 precise of 5)
4731 cycles for 100 * Log10 (MasmBasic, Log10 of 5)
2860 cycles for 100 * Sse2_log10_precise (Guga SSE2 Log10 precise of 5)
4780 cycles for 100 * Log10 (MasmBasic, Log10 of 5)
2834 cycles for 100 * Sse2_log10_precise (Guga SSE2 Log10 precise of 5)
4775 cycles for 100 * Log10 (MasmBasic, Log10 of 5)
2878 cycles for 100 * Sse2_log10_precise (Guga SSE2 Log10 precise of 5)
4760 cycles for 100 * Log10 (MasmBasic, Log10 of 5)
2858 cycles for 100 * Sse2_log10_precise (Guga SSE2 Log10 precise of 5)
4742 cycles for 100 * Log10 (MasmBasic, Log10 of 5)
0.698970004336019 for Sse2_log10_precise (Guga SSE2 Log10 precise of 5)
0.698970004336019 for Log10 (MasmBasic, Log10 of 5)
Great !
Small improve.
Replace the cmp ecx, 1023; with a test ecx (not 1023), like this:
test ecx, 0FFFFFC00h
jz loc_4C6BFC
; cmp ecx, 1023; Check for special cases (NAN, INF etc). Otherwise, jmp to the start of computation
; jbe loc_4C6BFC
It should speed up more 2 or 3 % :bgrin:
Btw...take a look at the source. I commented it a little bit and also changed the Tables to hold the proper values. Log10_Table_T and SSE_Log1020 . Now the 2nd Real8 on both tables can be zeroed and later improve a bit more using a array of 2 Real 8 instead of 4 Real8 values on table Log10_Table_T and Log10_Table_B and perhaps using mulsd in some parts of teh code rather mulpd. Didn´t checked this part yet, but i guess that it is fast and precise enough already :)
Log10_Table_T table is formed by the log10(x) where x = 1 to 2
I didn´t found out yet how those values where generated, but it was calculated as follow:
; Log10_Table_T is formed by a log10(x) where x is the same as below:
; dq log10(1), log10(1), log10(1.01587301587301580), log10(1)
; dq log10(1.03121852970795570), log10(1), log10(1.04703476482617600), log10(1)
; dq log10(1.06224066390041490), log10(1), log10(1.07789473684210520), log10(1)
; dq log10(1.09401709401709410), log10(1), log10(1.10942578548212340), log10(1)
; dq log10(1.12527472527472530), log10(1), log10(1.14031180400890860), log10(1)
; dq log10(1.15640880858272150), log10(1), log10(1.17162471395881010), log10(1)
; dq log10(1.18724637681159420), log10(1), log10(1.20329024676850760), log10(1)
; dq log10(1.21904761904761890), log10(1), log10(1.23447860156720910), log10(1)
; dq log10(1.25030525030525030), log10(1), log10(1.26576019777503080), log10(1)
; dq log10(1.28160200250312890), log10(1), log10(1.29702343255224830), log10(1)
; dq log10(1.31282051282051280), log10(1), log10(1.32814526588845650), log10(1)
; dq log10(1.34383202099737530), log10(1), log10(1.35899137358991370), log10(1)
; dq log10(1.37541974479516460), log10(1), log10(1.39035980991174470), log10(1)
; dq log10(1.40659340659340670), log10(1), log10(1.42222222222222210), log10(1)
; dq log10(1.43719298245614050), log10(1), log10(1.45351312987934710), log10(1)
; dq log10(1.46915351506456250), log10(1), log10(1.48405797101449280), log10(1)
; dq log10(1.50036630036630040), log10(1), log10(1.51591413767579560), log10(1)
; dq log10(1.53178758414360510), log10(1), log10(1.54682779456193350), log10(1)
; dq log10(1.56216628527841350), log10(1), log10(1.57781201848998460), log10(1)
; dq log10(1.59377431906614800), log10(1), log10(1.60879811468970920), log10(1)
; dq log10(1.62539682539682540), log10(1), log10(1.64102564102564120), log10(1)
; dq log10(1.65561843168957170), log10(1), log10(1.67183673469387760), log10(1)
; dq log10(1.68698517298187810), log10(1), log10(1.70382695507487520), log10(1)
; dq log10(1.71812080536912770), log10(1), log10(1.73412362404741740), log10(1)
; dq log10(1.75042735042735040), log10(1), log10(1.76551724137931030), log10(1)
; dq log10(1.78086956521739140), log10(1), log10(1.79649122807017550), log10(1)
; dq log10(1.81238938053097340), log10(1), log10(1.82857142857142850), log10(1)
; dq log10(1.84338433843384330), log10(1), log10(1.86012715712988190), log10(1)
; dq log10(1.87545787545787520), log10(1), log10(1.89104339796860570), log10(1)
; dq log10(1.90689013035381750), log10(1), log10(1.92120075046904320), log10(1)
; dq log10(1.93755912961210970), log10(1), log10(1.95233555767397520), log10(1)
; dq log10(1.96923076923076930), log10(1), log10(1.98449612403100770), log10(1)
; dq log10(2), log10(1)
But, how the value of 1.01587301587301580, 1.87545787545787520, 1.98449612403100770 were generated, i have no idea yet
what about you check cpuid for SSE3 is on cpu and have a version with horizontal add,HADDPD instead of
movupd | shufps addps
would that be few cycles faster?
Quote from: daydreamer on August 19, 2020, 09:51:59 AM
what about you check cpuid for SSE3 is on cpu and have a version with horizontal add,HADDPD instead of
movupd | shufps addps
would that be few cycles faster?
Hi DayDreamer. probably it would be faster, but i don´t have how to assemble in SSE3 or SSE4. I didn´t have time to updated RosAsm to work with those yet. That´s why i´m using SSE2 trying to get the maximum speed of it (also could help others that don´t have newer processors)
Quote from: guga on August 19, 2020, 11:34:08 AM
Quote from: daydreamer on August 19, 2020, 09:51:59 AM
what about you check cpuid for SSE3 is on cpu and have a version with horizontal add,HADDPD instead of
movupd | shufps addps
would that be few cycles faster?
Hi DayDreamer. probably it would be faster, but i don´t have how to assemble in SSE3 or SSE4. I didn´t have time to updated RosAsm to work with those yet. That´s why i´m using SSE2 trying to get the maximum speed of it (also could help others that don´t have newer processors)
hows the macro caps for rosasm?possible to make SSE3 and SSE4 macros?
http://www.masmforum.com/board/index.php?topic=973.0 (http://www.masmforum.com/board/index.php?topic=973.0)
Quote from: daydreamer on August 20, 2020, 04:47:00 AM
Quote from: guga on August 19, 2020, 11:34:08 AM
Quote from: daydreamer on August 19, 2020, 09:51:59 AM
what about you check cpuid for SSE3 is on cpu and have a version with horizontal add,HADDPD instead of
movupd | shufps addps
would that be few cycles faster?
Hi DayDreamer. probably it would be faster, but i don´t have how to assemble in SSE3 or SSE4. I didn´t have time to updated RosAsm to work with those yet. That´s why i´m using SSE2 trying to get the maximum speed of it (also could help others that don´t have newer processors)
hows the macro caps for rosasm?possible to make SSE3 and SSE4 macros?
http://www.masmforum.com/board/index.php?topic=973.0 (http://www.masmforum.com/board/index.php?topic=973.0)
You mean using something like this ?
I'm having a little trouble including these macros.
This one doesn't look right:
Code:
ORPD MACRO M1,M2
DB 066H
ORPS MACRO M1,M2
ENDM
After fixing that, it hates these two:
Code:
CMPLTSD MACRO M1,M2
DB 0F2H
CMPLTPS M1,M2
END
and
Code:
CMPSD MACRO M1,M2,M3
DB 0F2H
CMPPS M1,M2,M3
ENDM
Never tried creating a pseudo-instruction as a macro by hand before.
In RosAsm we could hard code it with something like: DB 01 025 070
The problem is that it will be extremely hard to follow. Better would be implement SSE3 and SSE4 in RosAsm. I have to do it eventually. I just can´t implement it right now due to lack of time and several things to fix in RosAsm yet before implement SSE3/SSE4. I still need to try find some time to detach RosAsm internal code and create dlls for usage on the main tools, such as the disassembler, debugger, resources editor, forms creator and even the encoder. All of this would be better to be on their own dlls rather then a monosource as it is already.
Macros in RosAsm works inside "[" and "]" . The 1st bracket must be immediately followed by a separator "|". Like this:
[HIWORD | mov eax #1 | shr eax 16]
or the normal If Chain.
[If | cmp #1 #3 | jn#2 I1>]
[Else_if | jmp I9> | I1: | cmp #1 #3 | jn#2 I1>]
[Else | Jmp I9> | I1:]
[End_if | I1: | I9:]
Of course, this is not rigid syntax. It´s just the default macro set where the user can choose to use it or not or even write his own macro set.
If you need a fast SSE2 horizontal addition for 4 packed floats,
movaps xmm1,xmm0
shufps xmm1,xmm0,10110001b
addps xmm0,xmm1
movhlps xmm1,xmm0
addss xmm0,xmm1
Quote from: Siekmanski on August 20, 2020, 08:13:58 AM
If you need a fast SSE2 horizontal addition for 4 packed floats,
movaps xmm1,xmm0
shufps xmm1,xmm0,10110001b
addps xmm0,xmm1
movhlps xmm1,xmm0
addss xmm0,xmm1
Great. Many thanks marinus.
This could be easy and very usefull to create a macro to simulate the haddps opcodes :thumbsup: :thumbsup: :thumbsup: :thumbsup: :thumbsup:
Quote from: guga on August 20, 2020, 08:28:46 AM
Quote from: Siekmanski on August 20, 2020, 08:13:58 AM
If you need a fast SSE2 horizontal addition for 4 packed floats,
movaps xmm1,xmm0
shufps xmm1,xmm0,10110001b
addps xmm0,xmm1
movhlps xmm1,xmm0
addss xmm0,xmm1
Great. Many thanks marinus.
This could be easy and very usefull to create a macro to simulate the haddps opcodes :thumbsup: :thumbsup: :thumbsup: :thumbsup: :thumbsup:
:thumbsup: :thumbsup:
replace mnemonic HADDPS with macro with the same name in masm just use NOKEYWORD in the very beginning of source file with those mnemonics you want to reprogram to macros
and now I probaby get to be flamed by bare metal coders for this heresy :bgrin:
Hi Daydreamer :bgrin: :bgrin: :bgrin:
I´ll probably do this for RosAsm. A set of macros in SSE2 to simulate the behaviour of SSE3 and SSE4 are needed for Masm and RosAsm, specially when we want to create code that adapt to the user´s processor (or, in my case where i didn´t built SSE3/SSe4 opcodes yet).
These could be very handy :azn: :azn: :azn:
Hi Guga & friends,
Can I have some timings, please? This is work in progress, building requires MasmBasic version 2 September 2020 (http://masm32.com/board/index.php?topic=94.0). The core is here:
FastMath FastLog10 ; define a math function
For_ fct=0.0 To 10.0 Step 0.5
fld fct ; X
fstp REAL10 ptr [edi]
void Log10(fct) ; Y (built-in MasmBasic function)
fstp REAL10 ptr [edi+REAL10]
add edi, 2*REAL10
Next
FastMath ; -------- done -------------
Usage in the attached source (*.asc opens in RichMasm, WordPad, MS Word):
TestC proc
mov ebx, AlgoLoops-1 ; loop e.g. 100x
align 4
.Repeat
void FastLog10(MyExpo) ; <<<<<<< put result in ST(0)
fstp res8
dec ebx
.Until Sign?
ret
TestC endp
Timings:
Intel(R) Core(TM) i5-2450M CPU @ 2.50GHz (SSE4)
2244 µs for initialising FastLog10
3170 cycles for 100 * Log10 (Guga)
4783 cycles for 100 * MasmBasic Log10
1553 cycles for 100 * FastMath Log10
3148 cycles for 100 * Log10 (Guga)
4795 cycles for 100 * MasmBasic Log10
1554 cycles for 100 * FastMath Log10
3173 cycles for 100 * Log10 (Guga)
4788 cycles for 100 * MasmBasic Log10
1552 cycles for 100 * FastMath Log10
3319 cycles for 100 * Log10 (Guga)
4781 cycles for 100 * MasmBasic Log10
1552 cycles for 100 * FastMath Log10
3175 cycles for 100 * Log10 (Guga)
4786 cycles for 100 * MasmBasic Log10
1540 cycles for 100 * FastMath Log10
536 bytes for Log10 (Guga)
16 bytes for MasmBasic Log10
162 bytes for FastMath Log10
Real8 0.6989700043360187465
Real8 0.6989700043360188575
Real8 0.6989700043360188575
Intel(R) Core(TM) i7-4930K CPU @ 3.40GHz (SSE4)
28 µs for initialising FastLog10
3470 cycles for 100 * Log10 (Guga)
5838 cycles for 100 * MasmBasic Log10
1912 cycles for 100 * FastMath Log10
3473 cycles for 100 * Log10 (Guga)
5838 cycles for 100 * MasmBasic Log10
1907 cycles for 100 * FastMath Log10
3484 cycles for 100 * Log10 (Guga)
5833 cycles for 100 * MasmBasic Log10
1916 cycles for 100 * FastMath Log10
3481 cycles for 100 * Log10 (Guga)
5834 cycles for 100 * MasmBasic Log10
1913 cycles for 100 * FastMath Log10
3477 cycles for 100 * Log10 (Guga)
5838 cycles for 100 * MasmBasic Log10
1913 cycles for 100 * FastMath Log10
536 bytes for Log10 (Guga)
16 bytes for MasmBasic Log10
162 bytes for FastMath Log10
Real8 0.6989700043360187465
Real8 0.6989700043360188575
Real8 0.6989700043360188575
--- ok ---
New version, I had forgotten to switch off the range checks:
Intel(R) Core(TM) i5-2450M CPU @ 2.50GHz (SSE4)
88 µs for initialising FastLog10
2882 cycles for 100 * Log10 (Guga)
4780 cycles for 100 * MasmBasic Log10
1399 cycles for 100 * FastMath Log10
2918 cycles for 100 * Log10 (Guga)
5372 cycles for 100 * MasmBasic Log10
1502 cycles for 100 * FastMath Log10
2912 cycles for 100 * Log10 (Guga)
4787 cycles for 100 * MasmBasic Log10
1397 cycles for 100 * FastMath Log10
2882 cycles for 100 * Log10 (Guga)
4774 cycles for 100 * MasmBasic Log10
1387 cycles for 100 * FastMath Log10
2881 cycles for 100 * Log10 (Guga)
4774 cycles for 100 * MasmBasic Log10
1392 cycles for 100 * FastMath Log10
516 bytes for Log10 (Guga)
16 bytes for MasmBasic Log10
67 bytes for FastMath Log10
Real8 0.6989700043360187465
Real8 0.6989700043360188575
Real8 0.6989700043360188575
Guga's version is g from this post (http://masm32.com/board/index.php?topic=8744.msg95585#msg95585).
Intel(R) Core(TM) i7-4930K CPU @ 3.40GHz (SSE4)
78 µs for initialising FastLog10
3142 cycles for 100 * Log10 (Guga)
5851 cycles for 100 * MasmBasic Log10
1698 cycles for 100 * FastMath Log10
3144 cycles for 100 * Log10 (Guga)
5839 cycles for 100 * MasmBasic Log10
1694 cycles for 100 * FastMath Log10
3154 cycles for 100 * Log10 (Guga)
5834 cycles for 100 * MasmBasic Log10
1693 cycles for 100 * FastMath Log10
3142 cycles for 100 * Log10 (Guga)
5838 cycles for 100 * MasmBasic Log10
1695 cycles for 100 * FastMath Log10
3150 cycles for 100 * Log10 (Guga)
5836 cycles for 100 * MasmBasic Log10
1702 cycles for 100 * FastMath Log10
516 bytes for Log10 (Guga)
16 bytes for MasmBasic Log10
67 bytes for FastMath Log10
Real8 0.6989700043360187465
Real8 0.6989700043360188575
Real8 0.6989700043360188575
--- ok ---
Here is another one, with a FastSqrt added:
FastMath FastSqrt ; define a math function
For_ fct=0.0 To 10.0 Step 0.5
fld fct ; X
fld st
fstp REAL10 ptr [edi]
fsqrt ; Y
fstp REAL10 ptr [edi+REAL10]
add edi, 2*REAL10
Next
FastMath ; -------- done -------------
The speed gain is very modest, though:
1859 cycles for 100 * fsqrt
1398 cycles for 100 * FastSqrt
The tangens is more impressive:
10920 cycles for 100 * fptan
1390 cycles for 100 * FastTan
Hello,
I was trying to run your exe,and get this answer:
Quote
This file content an indesirable Virus and will be deleted soon
and the file is deleted.
With what did you compile your files ?
With MASM (ML.exe), but it 'compiles' also with UAsm and AsmC. Quality AV products at VirusTotal say it's clean (https://www.virustotal.com/gui/file/328a31d9191bb1123660f0f722eaf5868690c58df925d818f15e48993f21d425/detection).
I suggest you move your crappy AV software into the recycle bin (http://masm32.com/board/index.php?board=23.0). Is it a French product? "This file content an indesirable Virus" should be contains :cool:
Quote
I suggest you move your crappy AV software into the recycle bin.
You surely speak of Windows 10 Familial edition ,on a perfectly standard UC (4Go Mem,1 tera disk)
Intel(R) Core(TM) i7-4930K CPU @ 3.40GHz (SSE4)
44 µs for initialising FastLog10
3175 cycles for 100 * Log10 (Guga)
5845 cycles for 100 * MasmBasic Log10
1693 cycles for 100 * FastMath Log10
1545 cycles for 100 * fsqrt
1698 cycles for 100 * FastSqrt
3149 cycles for 100 * Log10 (Guga)
5838 cycles for 100 * MasmBasic Log10
1692 cycles for 100 * FastMath Log10
1545 cycles for 100 * fsqrt
1690 cycles for 100 * FastSqrt
3147 cycles for 100 * Log10 (Guga)
5836 cycles for 100 * MasmBasic Log10
1690 cycles for 100 * FastMath Log10
1549 cycles for 100 * fsqrt
1692 cycles for 100 * FastSqrt
3142 cycles for 100 * Log10 (Guga)
5840 cycles for 100 * MasmBasic Log10
1696 cycles for 100 * FastMath Log10
1542 cycles for 100 * fsqrt
1695 cycles for 100 * FastSqrt
3140 cycles for 100 * Log10 (Guga)
5841 cycles for 100 * MasmBasic Log10
1692 cycles for 100 * FastMath Log10
1544 cycles for 100 * fsqrt
1691 cycles for 100 * FastSqrt
516 bytes for Log10 (Guga)
16 bytes for MasmBasic Log10
67 bytes for FastMath Log10
14 bytes for fsqrt
67 bytes for FastSqrt
Real8 0.6989700043360187465
Real8 0.6989700043360188575
Real8 0.6989700043360188575
Real8 2.236067977499789805
Real8 2.236067977499789805
--- ok ---
QuoteIntel(R) Core(TM) i5-9400H CPU @ 2.50GHz (SSE4)
31 µs for initialising FastLog10
1622 cycles for 100 * Log10 (Guga)
3402 cycles for 100 * MasmBasic Log10
670 cycles for 100 * FastMath Log10
351 cycles for 100 * fsqrt
669 cycles for 100 * FastSqrt
1637 cycles for 100 * Log10 (Guga)
3474 cycles for 100 * MasmBasic Log10
669 cycles for 100 * FastMath Log10
328 cycles for 100 * fsqrt
667 cycles for 100 * FastSqrt
1583 cycles for 100 * Log10 (Guga)
3785 cycles for 100 * MasmBasic Log10
653 cycles for 100 * FastMath Log10
339 cycles for 100 * fsqrt
663 cycles for 100 * FastSqrt
1593 cycles for 100 * Log10 (Guga)
3320 cycles for 100 * MasmBasic Log10
653 cycles for 100 * FastMath Log10
341 cycles for 100 * fsqrt
663 cycles for 100 * FastSqrt
1590 cycles for 100 * Log10 (Guga)
3434 cycles for 100 * MasmBasic Log10
667 cycles for 100 * FastMath Log10
340 cycles for 100 * fsqrt
664 cycles for 100 * FastSqrt
516 bytes for Log10 (Guga)
16 bytes for MasmBasic Log10
67 bytes for FastMath Log10
14 bytes for fsqrt
67 bytes for FastSqrt
Real8 0.6989700043360187465
Real8 0.6989700043360188575
Real8 0.6989700043360188575
Real8 2.236067977499789805
Real8 2.236067977499789805
--- ok ---
Siekmanski & six_L, your built-in fsqrt is faster than the police allowed :cool:
FastMath shines for more complex functions, see this post (http://masm32.com/board/index.php?topic=8779.0) where it replaces the GSL Bessel function (https://www.gnu.org/software/gsl/doc/html/usage.html). Speed gain is roughly a factor 55 on my trusty old Core i5 :tongue:
AMD Ryzen 5 3400G with Radeon Vega Graphics (SSE4)
36 µs for initialising FastLog10
2396 cycles for 100 * Log10 (Guga)
6316 cycles for 100 * MasmBasic Log10
1584 cycles for 100 * FastMath Log10
690 cycles for 100 * fsqrt
1518 cycles for 100 * FastSqrt
2418 cycles for 100 * Log10 (Guga)
6323 cycles for 100 * MasmBasic Log10
1542 cycles for 100 * FastMath Log10
699 cycles for 100 * fsqrt
1585 cycles for 100 * FastSqrt
2431 cycles for 100 * Log10 (Guga)
6362 cycles for 100 * MasmBasic Log10
1523 cycles for 100 * FastMath Log10
660 cycles for 100 * fsqrt
1598 cycles for 100 * FastSqrt
2552 cycles for 100 * Log10 (Guga)
6347 cycles for 100 * MasmBasic Log10
1523 cycles for 100 * FastMath Log10
660 cycles for 100 * fsqrt
1547 cycles for 100 * FastSqrt
2466 cycles for 100 * Log10 (Guga)
6344 cycles for 100 * MasmBasic Log10
1520 cycles for 100 * FastMath Log10
660 cycles for 100 * fsqrt
1518 cycles for 100 * FastSqrt
516 bytes for Log10 (Guga)
16 bytes for MasmBasic Log10
67 bytes for FastMath Log10
14 bytes for fsqrt
67 bytes for FastSqrt
Real8 0.6989700043360187465
Real8 0.6989700043360188575
Real8 0.6989700043360188575
Real8 2.236067977499789805
Real8 2.236067977499789805
-
340 cycles for 100 * fsqrt Intel(R) Core(TM) i5-9400H (six_L)
660 cycles for 100 * fsqrt AMD Ryzen 5 3400G (Timo)
1545 cycles for 100 * fsqrt Intel(R) Core(TM) i7-4930K (Siekmanski)
1859 cycles for 100 * fsqrt Intel(R) Core(TM) i5-2450M (jj2007)
8 years of progress... but why did they pick fsqrt? The other timings are not so different :rolleyes:
Quote from: TouEnMasm on September 03, 2020, 01:54:40 AM
Quote
I suggest you move your crappy AV software into the recycle bin.
You surely speak of Windows 10 Familial edition ,on a perfectly standard UC (4Go Mem,1 tera disk)
Hi ToutEnMasm,
Could you try exclude the false-positives in your antivirus settings?
Any other candidates, i.e. slow math functions that deserve a speed boost?