Text only | Text with Images

The MASM Forum

General => The Laboratory => Topic started by: guga on August 16, 2020, 06:01:01 AM

Title: Fast Log10 approximation
Post by: guga on August 16, 2020, 06:01:01 AM

Hi Guys

Another test. This time is a Fast Log10 function on the same way as we are testing the Fast exp on this thread http://masm32.com/board/index.php?topic=8734.0

I succeeded to assemble the file with masm basic (have small errors, but it is working as expected :mrgreen:). The function calculates the Log10 of a number with a precision of 16 digits after the ".".

Also, it checks for NAN, Zero and Negative and positive Infinite and calculates denormalized values.

The resultant value is stored in xmm0. (This time i succeded to make it use less xmm registers :thumbsup: :thumbsup: :thumbsup: )

Important: If you would assemble the function alone, it is mandatory to align the data on a 16 byte boundary due to the usage of 128 bit addressing of xmm registers. (I used memory addresses to minimize the usage of so many xmm registers)

Btw, plsease don´t botter the labels on the outpuuted console. I don´t know how to configure masmbasic to output the proper value :bgrin: :bgrin: :bgrin:

Original value of Log10(5)
Log10(5) = 0.698970004336018804786261105275506973231810118537891458689
My version:
Log10(5) = 0.698970004336018857

In my PC, the average speed is around 2500 cycles, but again..we can make it work even faster with some extra effort. I can, make it works in something around 1600 cycles
(Without loosing precision) simply removing the check for NANs and commenting this line

Code Select



                cmp     ecx, 2045 ; Check for NAN, Infinite, Zero and Denormalized values
                jbe     loc_4C6BFC
(...)

loc_4C6BFC:

But, if we do so, we won´t be able to check for errors on the input number and neither would be able to calculate the log10 of denormalized values.

The parameters of usage are the same as in the exp function. So:

The parameters are:

1st Parameter = The value to be calculated. It can be a integer, a Float or a address to Real8 (or qword)
2nd Parameter - The flag to be used on each type of input. I build a set of equates for that.
SSE_EXP_INT equ 1
SSE_EXP_FLOAT equ 2
SSE_EXP_REAL8 equ 4
SSE_EXP_QWORD equ 8

So, if you want to calculate the log10 of 5, and the input is a integer you must use

invoke Sse2_log10_precise, 5, 1
or
invoke Sse2_log10_precise, 5, SSE_EXP_INT

If the input is a Float (Real4) you use:

MyValue Real4 5.0 ; I believe this is the masm syntax for Float, right ?

invoke Sse2_log10_precise, [MyValue], 2

or
invoke Sse2_log10_precise, [MyValue], SSE_EXP_FLOAT

If the input is a Real8, you must use as a input the address (Offset) of the Value, like this:

MyValue Real8 5.0
invoke Sse2_log10_precise, offset MyValue, 4

or

MyValue Real8 5.0
invoke Sse2_log10_precise, offset MyValue, SSE_EXP_REAL8

and, the same for using to calculate as a QWord (int64), but using 8 as a flag (2nd parameter) and the pointer to the int64 value on the 1st parameter.

Ex:
push 5 ; The value "5" we use to compute Real8 as i explained above
push offset MyExpo ; Pointer to Real8 or to a Qword only when we are using 64 bits values. For 32 Bits values (dword, Float etc),
; we use the value directly because int or Float are not pointers. (So, without the "offset" thing in masm)
call Sse2_log10_precise

RosAsm version is as follows:

Code Select


[FloatZero: R$ 0]
[SSE_Two52: Q$ 4841369599423283200, 4841369599423283200] ; = 2^52 ;D$ 0, 043300000, 0, 043300000]
[SSE_One: R$ 1, R$ 1]
[SSE_LOG_EMASK < 2.22507385850720082e-308 >]; Using Body Equate to set the proper value which is Closer to DBL_MIN that is: 2.2250738585072014e-308
[SSE_LOG_HIMASK < 1.79769313486231571e+308 >]; Using Body Equate to set the proper value which is Closer to DBL_MAX that is: 1.7976931348623158e+308
[SSE_Emask: R$ SSE_LOG_EMASK, SSE_LOG_EMASK] ; D$ 0-1, 0FFFFF,  0-1, 0FFFFF
[<16 SSE_LOG10_CC_0: R$ (111/256), R$ (111/256)] ; L10EA
[<16 SSE_Magic0: R$ 4.39804651110300781e+12, R$ 4.39804651110300781e+12]
[SSE_LOG10_HIMASK < 07FFFFFFF__80000000 >]; Using Body Equate to set the proper value which is Closer to DBL_MAX that is: 1.7976931348623158e+308
[SSE_HiMask0: Q$ SSE_LOG10_HIMASK, SSE_LOG10_HIMASK]
[<16 SSE_Place_Log2: D$ 0, 0
                     D$ 0FFFFFFFF, 0FFFFFFFF,
                     D$ 0FFFFFFFF, 0FFFFFFFF,
                     D$ 0, 0]
[<16 SSE_Log1020: R$ 3.01029995663952832e-1, R$ 2.83633945510449641e-14] ; Log10(2) = 0.3010299956639811952137388947244930267681898814621085413104274611
[<16 Log_Coeff0: R$ 21.5354732628465832, R$ -3.07179525615370474]
[<16 SSE_Log10Var1: R$ -10.8935578527763628, R$ 1.77588163534834509]
[<16 SSE_Log10Var2: R$ 5.66760060334353621, R$ -1.15501676674018694]
[<16 SSE_Log10Var3: R$ 1.61610240749971053e-3, R$ 0]

; most likely a Log10(x) table from 0 to 2 followed by some error
[<16 Log10_Table_T: R$ 0,                        R$ 0,                        R$ 6.83942453019881210e-003, R$ 1.06653844666057790e-013
                    R$ 1.33507081443440260e-002, R$ 8.67505934719954320e-014, R$ 1.99611018521181900e-002, R$ 9.23215523723840900e-014
                    R$ 2.62229227369061850e-002, R$ 7.49930509874774870e-014, R$ 3.25763513509400580e-002, R$ 2.41276323350416150e-014
                    R$ 3.90241079016959700e-002, R$ 1.07524865705891230e-014, R$ 4.50982556138797010e-002, R$ 2.01960064510954180e-014
                    R$ 5.12585643186866950e-002, R$ 3.16576133915421310e-014, R$ 5.70236199724831750e-002, R$ 2.44070710374291600e-014
                    R$ 6.31113911136935710e-002, R$ 2.48257788963393190e-014, R$ 6.87885240052992230e-002, R$ 1.09693456491489220e-013
                    R$ 7.45408528944153660e-002, R$ 8.48557345140544200e-014, R$ 8.03703965551676450e-002, R$ 5.64317690419344980e-014
                    R$ 8.60206705779091860e-002, R$ 2.11074530681352090e-014, R$ 9.14835662794075690e-002, R$ 2.48825492648481080e-014
                    R$ 9.70160548793046470e-002, R$ 8.88302993483350900e-014, R$ 1.02351435027458140e-001, R$ 8.15103304243699220e-014
                    R$ 1.07753177325776050e-001, R$ 4.45086172353454940e-014, R$ 1.12947822295495830e-001, R$ 3.08831178628687270e-015
                    R$ 1.18205353949292660e-001, R$ 3.88929326942825740e-014, R$ 1.23245578588807800e-001, R$ 4.71750903210203580e-014
                    R$ 1.28344985300145710e-001, R$ 6.57467750563177680e-014, R$ 1.33216699989134210e-001, R$ 2.71250821687622270e-014
                    R$ 1.38435254551609430e-001, R$ 7.56877432679591610e-015, R$ 1.43127205461155430e-001, R$ 6.81055570899612790e-015
                    R$ 1.48168577326714510e-001, R$ 6.02542674600232690e-014, R$ 1.52967460208515150e-001, R$ 2.83427723708756480e-014
                    R$ 1.57515087959154700e-001, R$ 1.09440764415528030e-013, R$ 1.62418959194383210e-001, R$ 5.35145729088875760e-014
                    R$ 1.67067178541742580e-001, R$ 5.99509971730059460e-014, R$ 1.71450865902443180e-001, R$ 1.13452388582535560e-013
                    R$ 1.76197300927015020e-001, R$ 3.28390775041250930e-015, R$ 1.80674603281659070e-001, R$ 1.03481913998834240e-013
                    R$ 1.85198545041771470e-001, R$ 3.73116362151814480e-014, R$ 1.89441967200082220e-001, R$ 2.98008559425539880e-014
                    R$ 1.93727260613627550e-001, R$ 8.13197508503595880e-014, R$ 1.98055259839406970e-001, R$ 3.57492486261199750e-014
                    R$ 2.02426824636404490e-001, R$ 7.53155364447138070e-014, R$ 2.06501548650066980e-001, R$ 7.07722033818675070e-014
                    R$ 2.10959407186123830e-001, R$ 1.06420585148621610e-013, R$ 2.15115366957320480e-001, R$ 6.74895867627637050e-014
                    R$ 2.18960252674605730e-001, R$ 6.67672190370376170e-014, R$ 2.23193863603228240e-001, R$ 1.36388742479585940e-014
                    R$ 2.27111265564531100e-001, R$ 2.32760258694916680e-014, R$ 2.31425484637043160e-001, R$ 2.92761620518569070e-014
                    R$ 2.35053696899512940e-001, R$ 6.25806885708624120e-014, R$ 2.39080054690248290e-001, R$ 3.00590055364650880e-014
                    R$ 2.43144090557620980e-001, R$ 1.05192270531529520e-014, R$ 2.46871963076841890e-001, R$ 3.27758102924169520e-014
                    R$ 2.50632111950153560e-001, R$ 2.79059969846129800e-014, R$ 2.54425100967296200e-001, R$ 2.43505807623931280e-014
                    R$ 2.58251508820308120e-001, R$ 6.53068360680055810e-014, R$ 2.62111929633533690e-001, R$ 7.78445532603097920e-014
                    R$ 2.65615893362905810e-001, R$ 1.97240087303329050e-014, R$ 2.69542633331980140e-001, R$ 6.12305485363238830e-014
                    R$ 2.73107313935042840e-001, R$ 3.18805618188936470e-014, R$ 2.76701495678366880e-001, R$ 1.05904780751833190e-013
                    R$ 2.80325670940214880e-001, R$ 4.14680805130042090e-014, R$ 2.83572747613220600e-001, R$ 1.90891161216058830e-014
                    R$ 2.87254964996350280e-001, R$ 1.65981890215736650e-014, R$ 2.90554464110186930e-001, R$ 4.83600159236775040e-014
                    R$ 2.94296613004917160e-001, R$ 9.56300328864571600e-014, R$ 2.97650255012513300e-001, R$ 8.72995849029738040e-014
                    R$ 3.01029995663952830e-001, R$ 2.83633945510449640e-014

Log10_Table_B: R$ (227328/524288), R$ (227328/524288), R$ (223776/524288), R$ (223776/524288)
               R$ (220446/524288), R$ (220446/524288), R$ (217116/524288), R$ (217116/524288)
               R$ (214008/524288), R$ (214008/524288), R$ (210900/524288), R$ (210900/524288)
               R$ (207792/524288), R$ (207792/524288), R$ (204906/524288), R$ (204906/524288)
               R$ (202020/524288), R$ (202020/524288), R$ (199356/524288), R$ (199356/524288)
               R$ (196581/524288), R$ (196581/524288), R$ (194028/524288), R$ (194028/524288)
               R$ (191475/524288), R$ (191475/524288), R$ (188922/524288), R$ (188922/524288)
               R$ (186480/524288), R$ (186480/524288), R$ (184149/524288), R$ (184149/524288)
               R$ (181818/524288), R$ (181818/524288), R$ (179598/524288), R$ (179598/524288)
               R$ (177378/524288), R$ (177378/524288), R$ (175269/524288), R$ (175269/524288)
               R$ (173160/524288), R$ (173160/524288), R$ (171162/524288), R$ (171162/524288)
               R$ (169164/524288), R$ (169164/524288), R$ (167277/524288), R$ (167277/524288)
               R$ (165279/524288), R$ (165279/524288), R$ (163503/524288), R$ (163503/524288)
               R$ (161616/524288), R$ (161616/524288), R$ (159840/524288), R$ (159840/524288)
               R$ (158175/524288), R$ (158175/524288), R$ (156399/524288), R$ (156399/524288)
               R$ (154734/524288), R$ (154734/524288), R$ (153180/524288), R$ (153180/524288)
               R$ (151515/524288), R$ (151515/524288), R$ (149961/524288), R$ (149961/524288)
               R$ (148407/524288), R$ (148407/524288), R$ (146964/524288), R$ (146964/524288)
               R$ (145521/524288), R$ (145521/524288), R$ (144078/524288), R$ (144078/524288)
               R$ (142635/524288), R$ (142635/524288), R$ (141303/524288), R$ (141303/524288)
               R$ (139860/524288), R$ (139860/524288), R$ (138528/524288), R$ (138528/524288)
               R$ (137307/524288), R$ (137307/524288), R$ (135975/524288), R$ (135975/524288)
               R$ (134754/524288), R$ (134754/524288), R$ (133422/524288), R$ (133422/524288)
               R$ (132312/524288), R$ (132312/524288), R$ (131091/524288), R$ (131091/524288)
               R$ (129870/524288), R$ (129870/524288), R$ (128760/524288), R$ (128760/524288)
               R$ (127650/524288), R$ (127650/524288), R$ (126540/524288), R$ (126540/524288)
               R$ (125430/524288), R$ (125430/524288), R$ (124320/524288), R$ (124320/524288)
               R$ (123321/524288), R$ (123321/524288), R$ (122211/524288), R$ (122211/524288)
               R$ (121212/524288), R$ (121212/524288), R$ (120213/524288), R$ (120213/524288)
               R$ (119214/524288), R$ (119214/524288), R$ (118326/524288), R$ (118326/524288)
               R$ (117327/524288), R$ (117327/524288), R$ (116439/524288), R$ (116439/524288)
               R$ (115440/524288), R$ (115440/524288), R$ (114552/524288), R$ (114552/524288)
               R$ (113664/524288), R$ (113664/524288)]

; Parameters flag
[SSE_EXP_INT 1
 SSE_EXP_FLOAT 2
 SSE_EXP_REAL8 4
 SSE_EXP_QWORD 8]

; Values to return
[SSE_EXP_INVALID_PARAMETER 0-1] ; Invalid flag
[SSE_UNDERFLOW 0-2] ; The inputed number is underflow
[SSE_OVERFLOWN 0-3] ; The inputed number is overflow
[SSE_INFINITE 0-4] ; General error. The inputed number is infinite, or NAN etc
[SSE_ZERO 0-5] ; The inputed number is zero. Log and Ln cannot have this
[SSE_NEG_INFINITE 0-6] ; Negative Infinite found
[SSE_POS_INFINITE 0-7] ; Negative Infinite found
[SSE_NAN 0-9] ; NAN. Not a number

Proc Sse2_log10_precise:
    Arguments @Number, @Flag
    Uses edx, ecx

    mov eax D@Flag
    Test_if eax SSE_EXP_INT
        cvtsi2sd xmm0 D@Number ; converts a signed integer to double
    Test_Else_if eax SSE_EXP_FLOAT
        cvtss2sd xmm0 X@Number ; converts a single precision float to double
    Test_Else_if eax SSE_EXP_REAL8
        mov eax D@Number | movsd XMM0 X$eax
    Test_Else_if eax SSE_EXP_QWORD
        mov eax D@Number | movq XMM0 X$eax
    Test_Else
        xor eax eax | ExitP ; return 0 Invalid parameter
    Test_End

    xor edx edx
    movupd XMM1 XMM0
    unpcklpd XMM0 XMM0
    psrlq XMM1 52 | pextrw ecx XMM1 0 | and ecx 0FFF | sub ecx 1

    ...If ecx > 2045 ; Special cases. Number have some error

        .SSE_D_If xmm0 <= X$Float_Zero ; Inputed value is zero
            mov eax SSE_ZERO
            SSE_D_If xmm0 < X$Float_Zero ; Inputed value is negatve
                mov eax SSE_NEG_INFINITE
            SSE_D_End
            ExitP ; Exit the function
        .SSE_D_Else
            ..If ecx = 0-1  ; number is denormalized. We can continue
                mulsd XMM0 X$SSE_Two52 ; for very tinny numbers. Ex: x = 2e-314;XMM1
                mov edx 0-52
                xor eax eax
                movupd XMM1 XMM0
                unpcklpd XMM0 XMM0
                psrlq XMM1 52 | pextrw ecx XMM1 0 | and ecx 0FFF | sub ecx 1
            ..Else
                movupd XMM1 X$SSE_One ; same as SSE_One2
                andpd XMM0 X$SSE_Emask ; same as SSE_Emask2
                orpd XMM0 XMM1
                cmpsd XMM1 XMM0 0 ; (EQ) error in rosasm
                pextrw eax XMM1 0
                If eax = 0 ; Not a Number
                    mov eax SSE_NAN
                Else ; Number is positive infinite
                    mov eax SSE_POS_INFINITE
                End_If
                ExitP ; Exit the function
            ..End_If
        .SSE_D_End

    ...End_If

    movupd XMM1 X$SSE_Magic0 | andpd XMM0 X$SSE_Emask | orpd XMM0 X$SSE_One | addpd XMM1 XMM0
    pextrw eax XMM1 0 | and eax ((Size_of_LogTable*2)-48)

    movupd XMM4 X$Log10_Table_T+eax
    movupd XMM1 X$SSE_HiMask0 | andpd XMM1 XMM0
    subpd XMM0 XMM1 | mulpd XMM0 X$Log10_Table_B+eax
    mulpd XMM1 X$Log10_Table_B+eax | subpd XMM1 X$SSE_LOG10_CC_0
    addsd XMM4 XMM1

    sub ecx 1022 | add ecx edx
    cvtsi2sd XMM2 ecx | shl ecx 10 | add eax ecx | mov ecx 16 | mov edx 0 | cmp eax 0 | cmovz edx ecx

    movupd XMM3 XMM0 | andpd XMM3 X$SSE_Place_Log2+edx
    addpd XMM0 XMM1
     ; same as SSE_Place_Log0
    unpcklpd XMM2 XMM2 | mulpd XMM2 X$SSE_Log1020 | addpd XMM2 XMM3 | addpd XMM4 XMM2

    movupd XMM1 X$Log_Coeff0
    movupd XMM2 XMM0 | mulpd XMM2 XMM2 | mulsd XMM2 XMM2 | mulsd XMM2 XMM0
    mulpd XMM1 XMM0 | addpd XMM1 X$SSE_Log10Var1

    mulpd XMM1 XMM0
    addpd XMM1 X$SSE_Log10Var2 | mulpd XMM1 XMM2

    movupd xmm2 X$SSE_Log10Var3 | mulpd XMM2 XMM0

    movupd XMM3 XMM4
    unpckhpd XMM3 XMM3

    movupd XMM0 XMM1
    addpd XMM1 XMM2
    unpckhpd XMM0 XMM0

    addsd XMM0 XMM1
    addsd XMM0 XMM3
    addsd XMM0 XMM4

EndP

My timmings

Code Select



AMD Ryzen 5 2400G with Radeon Vega Graphics     (SSE4)

2747    cycles for 100 * Sse2_log10_precise (Guga SSE2 Log10 precise , 2.7182818^5)
11641   cycles for 100 * ExpXY (MasmBasic, 2.7182818^5)
27174   cycles for 100 * pow (CRT, 2.7182818^5)

2619    cycles for 100 * Sse2_log10_precise (Guga SSE2 Log10 precise , 2.7182818^5)
11399   cycles for 100 * ExpXY (MasmBasic, 2.7182818^5)
27011   cycles for 100 * pow (CRT, 2.7182818^5)

2632    cycles for 100 * Sse2_log10_precise (Guga SSE2 Log10 precise , 2.7182818^5)
11207   cycles for 100 * ExpXY (MasmBasic, 2.7182818^5)
25996   cycles for 100 * pow (CRT, 2.7182818^5)

2593    cycles for 100 * Sse2_log10_precise (Guga SSE2 Log10 precise , 2.7182818^5)
11419   cycles for 100 * ExpXY (MasmBasic, 2.7182818^5)
26219   cycles for 100 * pow (CRT, 2.7182818^5)

2553    cycles for 100 * Sse2_log10_precise (Guga SSE2 Log10 precise , 2.7182818^5)
11310   cycles for 100 * ExpXY (MasmBasic, 2.7182818^5)
26940   cycles for 100 * pow (CRT, 2.7182818^5)

148.413159102577 for Sse2_log10_precise (Guga SSE2 Log10 precise , 2.7182818^5)
148.413159102577 for ExpXY (MasmBasic, 2.7182818^5)
148.413159102577 for pow (CRT, 2.7182818^5)

Updated new version with the proper values (in between parenthesis) calculated. File: TimmingsLog10g.zip (faster and still accurate)

Title: Re: Fast Log10 approximation
Post by: Siekmanski on August 16, 2020, 06:49:08 AM

Code Select

Intel(R) Core(TM) i7-4930K CPU @ 3.40GHz (SSE4)

3398    cycles for 100 * Sse2_log10_precise (Guga SSE2 Log10 precise , 2.7182818^5)
12420   cycles for 100 * ExpXY (MasmBasic, 2.7182818^5)
24740   cycles for 100 * pow (CRT, 2.7182818^5)

3412    cycles for 100 * Sse2_log10_precise (Guga SSE2 Log10 precise , 2.7182818^5)
12403   cycles for 100 * ExpXY (MasmBasic, 2.7182818^5)
24733   cycles for 100 * pow (CRT, 2.7182818^5)

3402    cycles for 100 * Sse2_log10_precise (Guga SSE2 Log10 precise , 2.7182818^5)
12412   cycles for 100 * ExpXY (MasmBasic, 2.7182818^5)
24766   cycles for 100 * pow (CRT, 2.7182818^5)

3401    cycles for 100 * Sse2_log10_precise (Guga SSE2 Log10 precise , 2.7182818^5)
12403   cycles for 100 * ExpXY (MasmBasic, 2.7182818^5)
24788   cycles for 100 * pow (CRT, 2.7182818^5)

3411    cycles for 100 * Sse2_log10_precise (Guga SSE2 Log10 precise , 2.7182818^5)
12437   cycles for 100 * ExpXY (MasmBasic, 2.7182818^5)
24734   cycles for 100 * pow (CRT, 2.7182818^5)

148.413159102577 for Sse2_log10_precise (Guga SSE2 Log10 precise , 2.7182818^5)
148.413159102577 for ExpXY (MasmBasic, 2.7182818^5)
148.413159102577 for pow (CRT, 2.7182818^5)

Title: Re: Fast Log10 approximation
Post by: guga on August 16, 2020, 07:24:05 AM

Hi Marinus

You were right. I´m liking this SSE stuff :bgrin: :bgrin: :bgrin: I must confess, it´s a bit complicated at 1st but the results worth the efford :thumbsup: :thumbsup:

I´m giving a try on those faster functions to try building a dll for use with the image processing functions. This could be very usefull.

Title: Re: Fast Log10 approximation
Post by: HSE on August 16, 2020, 07:47:28 AM

:thumbsup:

Code Select

AMD A6-3500 APU with Radeon(tm) HD Graphics (SSE3)

5396    cycles for 100 * Sse2_log10_precise (Guga SSE2 Log10 precise )
11737   cycles for 100 * SmplMath:  fSlv MyReal8 = log(MyExpo)

5388    cycles for 100 * Sse2_log10_precise (Guga SSE2 Log10 precise )
11746   cycles for 100 * SmplMath:  fSlv MyReal8 = log(MyExpo)

5387    cycles for 100 * Sse2_log10_precise (Guga SSE2 Log10 precise )
11779   cycles for 100 * SmplMath:  fSlv MyReal8 = log(MyExpo)

5447    cycles for 100 * Sse2_log10_precise (Guga SSE2 Log10 precise )
11731   cycles for 100 * SmplMath:  fSlv MyReal8 = log(MyExpo)

5388    cycles for 100 * Sse2_log10_precise (Guga SSE2 Log10 precise )
11745   cycles for 100 * SmplMath:  fSlv MyReal8 = log(MyExpo)

0.699076046207356 for Sse2_log10_precise (Guga SSE2 Log10 precise )
0.698970004336019 for SmplMath:  fSlv MyReal8 = log(MyExpo)

Title: Re: Fast Log10 approximation
Post by: Siekmanski on August 16, 2020, 07:48:26 AM

:bgrin:

Title: Re: Fast Log10 approximation
Post by: jj2007 on August 16, 2020, 08:00:03 AM

Does it return the result in xmm0?

call Sse2_log10_precise ; invoke Sse2_log10_precise, 5, 1
movlps MyReal8, xmm0

Title: Re: Fast Log10 approximation
Post by: HSE on August 16, 2020, 08:07:03 AM

Apparently is faster :

Code Select

    movdqu oword ptr MyReal8, xmm0

LATER: :biggrin:

Quote from: jj2007 on August 16, 2020, 08:00:03 AM
movlps MyReal8, xmm0

perhaps is:

Code Select

movlpd MyReal8, xmm0

Code Select

AMD A6-3500 APU with Radeon(tm) HD Graphics (SSE3)

5210    cycles for 100 * Sse2_log10_precise (Guga SSE2 Log10 precise )
11760   cycles for 100 * SmplMath:  fSlv MyReal8 = log(MyExpo)

5217    cycles for 100 * Sse2_log10_precise (Guga SSE2 Log10 precise )
11750   cycles for 100 * SmplMath:  fSlv MyReal8 = log(MyExpo)

5213    cycles for 100 * Sse2_log10_precise (Guga SSE2 Log10 precise )
11758   cycles for 100 * SmplMath:  fSlv MyReal8 = log(MyExpo)

5228    cycles for 100 * Sse2_log10_precise (Guga SSE2 Log10 precise )
11796   cycles for 100 * SmplMath:  fSlv MyReal8 = log(MyExpo)

5215    cycles for 100 * Sse2_log10_precise (Guga SSE2 Log10 precise )
11820   cycles for 100 * SmplMath:  fSlv MyReal8 = log(MyExpo)

0.699076046207356 for Sse2_log10_precise (Guga SSE2 Log10 precise )
0.698970004336019 for SmplMath:  fSlv MyReal8 = log(MyExpo)

Title: Re: Fast Log10 approximation
Post by: guga on August 16, 2020, 08:30:25 AM

Quote from: jj2007 on August 16, 2020, 08:00:03 AM
Does it return the result in xmm0?

call Sse2_log10_precise ; invoke Sse2_log10_precise, 5, 1
movlps MyReal8, xmm0

Hi JJ

Yes, the result is in xmm0

If i ported it correctly, it should return this:

(https://i.ibb.co/F3bBsQV/sfd-Image1.png) (https://ibb.co/C85nJj2)

Title: Re: Fast Log10 approximation
Post by: jj2007 on August 16, 2020, 08:38:17 AM

Quote from: guga on August 16, 2020, 08:30:25 AM
Quote from: jj2007 on August 16, 2020, 08:00:03 AM
Does it return the result in xmm0?

call Sse2_log10_precise ; invoke Sse2_log10_precise, 5, 1
movlps MyReal8, xmm0
Hi JJ

Yes, the result is in xmm0

If i ported it correctly, it should return this:

Just add the movlps MyReal8, xmm0, and you will see the result at the end.

Title: Re: Fast Log10 approximation
Post by: HSE on August 16, 2020, 08:58:08 AM

movlpd

Title: Re: Fast Log10 approximation
Post by: jj2007 on August 16, 2020, 09:46:48 AM

Quote from: HSE on August 16, 2020, 08:58:08 AM
movlpd

Most if not all sse* mov instructions don't care what format they are moving. I use movlps because it's one byte shorter.

Title: Re: Fast Log10 approximation
Post by: guga on August 16, 2020, 09:52:04 AM

Quote from: jj2007 on August 16, 2020, 08:38:17 AM
Quote from: guga on August 16, 2020, 08:30:25 AM
Quote from: jj2007 on August 16, 2020, 08:00:03 AM
Does it return the result in xmm0?

call Sse2_log10_precise ; invoke Sse2_log10_precise, 5, 1
movlps MyReal8, xmm0
Hi JJ

Yes, the result is in xmm0

If i ported it correctly, it should return this:

Just add the movlps MyReal8, xmm0, and you will see the result at the end.

Hi JJ. the value is showed incorrectly. I guess i didn´t port it properly.

Most likely is on the Data variables.

Log10_Table_T and Log10_Table_B are part of the same Table. So, basically it is a Array divided by 2, but when i ported to masm, the Values of Log10_Table_B was gone :dazzled: :dazzled: :dazzled:

When i ported as:

Log10_Table_B dq (227328/524288), (227328/524288), (223776/524288), (223776/524288)
dq (220446/524288), (220446/524288), (217116/524288), (217116/524288)
dq (214008/524288), (214008/524288), (210900/524288), (210900/524288)
dq (207792/524288), (207792/524288), (204906/524288), (204906/524288)
dq (202020/524288), (202020/524288), (199356/524288), (199356/524288)
(...)

All those values between parenthesis was zeroed. Masm compiled it as:
Log10_Table_B xmmword 41h dup(0)

Why ?

How can i make masm calculate the values of the Qwords, such as 22046/524288 ? I need to remove the parenthesis ?

Title: Re: Fast Log10 approximation
Post by: HSE on August 16, 2020, 10:19:50 AM

Quote from: jj2007 on August 16, 2020, 09:46:48 AM
Most if not all sse* mov instructions don't care what format they are moving. I use movlps because it's one byte shorter.

Well JJ, you are lucky: your machine is smarter than mine!

movlps :

Code Select

9744 cycles for 100 * Sse2_log10_precise (Guga SSE2 Log10 precise )

molpd :

Code Select

5192 cycles for 100 * Sse2_log10_precise (Guga SSE2 Log10 precise )

movdqu oword ptr MyReal8, xmm0 :

Code Select

5384 cycles for 100 * Sse2_log10_precise (Guga SSE2 Log10 precise

Title: Re: Fast Log10 approximation
Post by: jj2007 on August 16, 2020, 10:27:33 AM

That looks odd, can you post the executables? The call to the routine is a hundred times slower than the movlps/movpld :cool:

Let's test it:

Code Select

Intel(R) Core(TM) i5-2450M CPU @ 2.50GHz (SSE4)

25285   cycles for 10000 * movlps
25166   cycles for 10000 * movlpd
25808   cycles for 10000 * movdqu

25155   cycles for 10000 * movlps
25053   cycles for 10000 * movlpd
25257   cycles for 10000 * movdqu

25271   cycles for 10000 * movlps
24811   cycles for 10000 * movlpd
25296   cycles for 10000 * movdqu

25034   cycles for 10000 * movlps
25595   cycles for 10000 * movlpd
25706   cycles for 10000 * movdqu

24902   cycles for 10000 * movlps
25345   cycles for 10000 * movlpd
25546   cycles for 10000 * movdqu

28      bytes for movlps
32      bytes for movlpd
32      bytes for movdqu

R8      1234567890.123456716
R8      1234567890.123456716
R8      1234567890.123456716

Title: Re: Fast Log10 approximation
Post by: HSE on August 16, 2020, 10:31:42 AM

It's the rutine with the mov .

The code its above, just change the line 345.

Title: Re: Fast Log10 approximation
Post by: jj2007 on August 16, 2020, 10:45:44 AM

Quote from: HSE on August 16, 2020, 10:31:42 AM
It's the rutine with the mov .

The code its above, just change the line 345.

Please show the line in context, I can't find it.

Title: Re: Fast Log10 approximation
Post by: guga on August 16, 2020, 10:57:53 AM

Guys. The syntax of tableB is wrong. How to make masm calculate the values in parenthesis in the data section ?

Log10_Table_B dq (227328/524288), (227328/524288), <----- this is causing masm to compile as db 000000000 rather then the values of each division

Title: Re: Fast Log10 approximation
Post by: guga on August 16, 2020, 01:42:48 PM

OK, Guys...Now it is working as expected. The result is ok.

All i did was replace the values of the variables in between ( ) with their calculated values and it returned the correct answer as expected.

Attached update. The src is the same as before, except the fix of the values in parenthesis. Updated the 1st post too

Result is:

Code Select


AMD Ryzen 5 2400G with Radeon Vega Graphics     (SSE4)

2786    cycles for 100 * Sse2_log10_precise (Guga SSE2 Log10 precise of 5)
11317   cycles for 100 * ExpXY (MasmBasic, 2.7182818^5)
26628   cycles for 100 * pow (CRT, 2.7182818^5)

2662    cycles for 100 * Sse2_log10_precise (Guga SSE2 Log10 precise of 5)
11268   cycles for 100 * ExpXY (MasmBasic, 2.7182818^5)
27312   cycles for 100 * pow (CRT, 2.7182818^5)

2686    cycles for 100 * Sse2_log10_precise (Guga SSE2 Log10 precise of 5)
11511   cycles for 100 * ExpXY (MasmBasic, 2.7182818^5)
26168   cycles for 100 * pow (CRT, 2.7182818^5)

2832    cycles for 100 * Sse2_log10_precise (Guga SSE2 Log10 precise of 5)
12049   cycles for 100 * ExpXY (MasmBasic, 2.7182818^5)
27194   cycles for 100 * pow (CRT, 2.7182818^5)

2781    cycles for 100 * Sse2_log10_precise (Guga SSE2 Log10 precise of 5)
11254   cycles for 100 * ExpXY (MasmBasic, 2.7182818^5)
26377   cycles for 100 * pow (CRT, 2.7182818^5)

0.698970004336019 for Sse2_log10_precise (Guga SSE2 Log10 precise of 5)
148.413159102577 for ExpXY (MasmBasic, 2.7182818^5)
148.413159102577 for pow (CRT, 2.7182818^5)

--- ok ---

Title: Re: Fast Log10 approximation
Post by: TimoVJL on August 16, 2020, 05:18:16 PM

Code Select

AMD Ryzen 5 3400G with Radeon Vega Graphics     (SSE4)

26512   cycles for 10000 * movlps
26975   cycles for 10000 * movlpd
62428   cycles for 10000 * movdqu

26448   cycles for 10000 * movlps
26523   cycles for 10000 * movlpd
62479   cycles for 10000 * movdqu

26537   cycles for 10000 * movlps
26812   cycles for 10000 * movlpd
62242   cycles for 10000 * movdqu

26604   cycles for 10000 * movlps
26673   cycles for 10000 * movlpd
62401   cycles for 10000 * movdqu

26578   cycles for 10000 * movlps
26707   cycles for 10000 * movlpd
62530   cycles for 10000 * movdqu

28      bytes for movlps
32      bytes for movlpd
32      bytes for movdqu

R8      1234567890.123456716
R8      1234567890.123456716
R8      1234567890.123456716

-

Title: Re: Fast Log10 approximation
Post by: HSE on August 17, 2020, 02:53:55 AM

Quote from: jj2007 on August 16, 2020, 10:45:44 AM
Please show the line in context, I can't find it.

:biggrin: Have you problems with the TestBed?
(aclaration: the TestBed program was written by jj2007)

Quote from: TimoVJL on August 16, 2020, 05:18:16 PM
Code Select Expand
AMD Ryzen 5 3400G with Radeon Vega Graphics (SSE4)

Only SS3 here! Perhaps that is.

Quote from: guga on August 16, 2020, 10:57:53 AM
Guys. The syntax of tableB is wrong. How to make masm calculate the values in parenthesis in the data section ?

Log10_Table_B dq (227328/524288), (227328/524288), <----- this is causing masm to compile as db 000000000 rather then the values of each division

You can use qWord's floating point arithmetic while assembling: MREAL-macros (http://masm32.com/board/index.php?topic=3225.msg33774#msg33774)

Title: Re: Fast Log10 approximation
Post by: jack on August 17, 2020, 03:18:10 AM

from what I gather you use tables to calculate the logarithm, just for fun, using Maple I computed a rational approximation to ln(1+x) and by multiplying that by log10(e) you get an approximation to log10(1+x) with x between 0 and 1

Code Select


ln(1+x) = (1+(2.45432048794495419+(2.17440739254242255+(.841813647259988564+(.135819954481562186+(0.687077985071464530e-2+0.231261455529019880e-4*x)*x)*x)*x)*x)*x)*x/(1+(2.95432048794495173+(3.31823430318176310+(1.76615730286299464+(.451400626940937546+(0.492131362215352252e-1+0.158489557252300951e-2*x)*x)*x)*x)*x)*x)

log10(1+x) = (.434294481903251828+(1.06589784473659010+(.944333131990812123+(.365595021795863521+(0.589858567636932956e-1+(0.298394177553741882e-2+0.100435574013167601e-4*x)*x)*x)*x)*x)*x)*x/(1+(2.95432048794495173+(3.31823430318176310+(1.76615730286299464+(.451400626940937546+(0.492131362215352252e-1+0.158489557252300951e-2*x)*x)*x)*x)*x)*x)

Title: Re: Fast Log10 approximation
Post by: guga on August 17, 2020, 04:54:55 AM

Quote from: jack on August 17, 2020, 03:18:10 AM
from what I gather you use tables to calculate the logarithm, just for fun, using Maple I computed a rational approximation to ln(1+x) and by multiplying that by log10(e) you get an approximation to log10(1+x) with x between 0 and 1
Code Select Expand
ln(1+x) = (1+(2.45432048794495419+(2.17440739254242255+(.841813647259988564+(.135819954481562186+(0.687077985071464530e-2+0.231261455529019880e-4*x)*x)*x)*x)*x)*x)*x/(1+(2.95432048794495173+(3.31823430318176310+(1.76615730286299464+(.451400626940937546+(0.492131362215352252e-1+0.158489557252300951e-2*x)*x)*x)*x)*x)*x) log10(1+x) = (.434294481903251828+(1.06589784473659010+(.944333131990812123+(.365595021795863521+(0.589858567636932956e-1+(0.298394177553741882e-2+0.100435574013167601e-4*x)*x)*x)*x)*x)*x)*x/(1+(2.95432048794495173+(3.31823430318176310+(1.76615730286299464+(.451400626940937546+(0.492131362215352252e-1+0.158489557252300951e-2*x)*x)*x)*x)*x)*x)

Great, jack. Thanks a lot.

Some questions:

1 - What is maple ? (Where can i download it)
2 - Did you tested the accuracy ? What is the precision of the values you inputted, and also can it handle denormalized values too ?
3 - Can mapple produces these polynomial values without division ? Division is slow.

Title: Re: Fast Log10 approximation
Post by: Siekmanski on August 17, 2020, 05:45:59 AM

Hi Guga,

You can always avoid divisions when using constants.
Make a reciprocal of the constant and you can multiply instead of divide.

Value real4 256.0
recValue real4 0.00390625 (1/256)

divss xmm0,real4 ptr Value
mulss xmm0,real4 ptr recValue

both have the same result.

Title: Re: Fast Log10 approximation
Post by: jack on August 17, 2020, 05:53:26 AM

1- maple is a Computer Algebra System https://www.maplesoft.com/ns/maple/cas/computer-algebra-systems-math-education.aspx
2- the input range for x is between 0 and 1, the error is probably +/-1e-16 it depends on the precision used in the evaluation, as for de-normalized values - this is just a rational polynomial approximation unrelated to floating point intrinsic
3- yes, but a rational polynomial usually requires fewer terms to achieve the precision than a plain polynomial, it would take a polynomial of degree 19 or 20 to get the same precision

Title: Re: Fast Log10 approximation
Post by: mineiro on August 17, 2020, 06:13:45 AM

Quote from: Siekmanski on August 17, 2020, 05:45:59 AM
Make a reciprocal of the constant and you can multiply instead of divide.

Multiply is more eficient than divide :thumbsup:

Integer division using reciprocals -- Robert Alverson
https://www.computer.org/csdl/proceedings-article/arith/1991/00145558/12OmNyaXPS1

Title: Re: Fast Log10 approximation
Post by: jj2007 on August 17, 2020, 10:20:09 AM

Quote from: HSE on August 16, 2020, 07:47:28 AM
:thumbsup:

Code Select Expand
AMD A6-3500 APU with Radeon(tm) HD Graphics (SSE3) 5396 cycles for 100 * Sse2_log10_precise (Guga SSE2 Log10 precise ) 11737 cycles for 100 * SmplMath: fSlv MyReal8 = log(MyExpo)

Code Select

Intel(R) Core(TM) i5-2450M CPU @ 2.50GHz (SSE4)

3163    cycles for 100 * Sse2_log10_precise (Guga SSE2 Log10 precise )
5060    cycles for 100 * Log10

3211    cycles for 100 * Sse2_log10_precise (Guga SSE2 Log10 precise )
4933    cycles for 100 * Log10

3209    cycles for 100 * Sse2_log10_precise (Guga SSE2 Log10 precise )
4823    cycles for 100 * Log10

3186    cycles for 100 * Sse2_log10_precise (Guga SSE2 Log10 precise )
4773    cycles for 100 * Log10

3219    cycles for 100 * Sse2_log10_precise (Guga SSE2 Log10 precise )
4739    cycles for 100 * Log10

536     bytes for Sse2_log10_precise (Guga SSE2 Log10 precise )
16      bytes for Log10

0.699076046207356 for Sse2_log10_precise (Guga SSE2 Log10 precise )
0.698970004336019 for Log10
0.6989700043360188 expected

Title: Re: Fast Log10 approximation
Post by: HSE on August 17, 2020, 10:36:07 AM

With last Guga correction:

Code Select

AMD A6-3500 APU with Radeon(tm) HD Graphics (SSE3)

5257    cycles for 100 * Sse2_log10_precise (Guga SSE2 Log10 precise )
11751   cycles for 100 * SmplMath:  fSlv MyReal8 = log(MyExpo)
11687   cycles for 100 * JJ Log10

5251    cycles for 100 * Sse2_log10_precise (Guga SSE2 Log10 precise )
11741   cycles for 100 * SmplMath:  fSlv MyReal8 = log(MyExpo)
11637   cycles for 100 * JJ Log10

5254    cycles for 100 * Sse2_log10_precise (Guga SSE2 Log10 precise )
11758   cycles for 100 * SmplMath:  fSlv MyReal8 = log(MyExpo)
11664   cycles for 100 * JJ Log10

5248    cycles for 100 * Sse2_log10_precise (Guga SSE2 Log10 precise )
11760   cycles for 100 * SmplMath:  fSlv MyReal8 = log(MyExpo)
11641   cycles for 100 * JJ Log10

5274    cycles for 100 * Sse2_log10_precise (Guga SSE2 Log10 precise )
11735   cycles for 100 * SmplMath:  fSlv MyReal8 = log(MyExpo)
11653   cycles for 100 * JJ Log10

0.698970004336019 for Sse2_log10_precise (Guga SSE2 Log10 precise )
0.698970004336019 for SmplMath:  fSlv MyReal8 = log(MyExpo)
0.698970004336019 for JJ Log10

Title: Re: Fast Log10 approximation
Post by: jj2007 on August 17, 2020, 10:54:07 AM

MB Log10 under the hood:

Code Select

fldlg2
fld MyExpo
fyl2x
fstp MyReal8

Title: Re: Fast Log10 approximation
Post by: Mikl__ on August 17, 2020, 01:00:08 PM

[delete]

Title: Re: Fast Log10 approximation
Post by: guga on August 17, 2020, 05:14:44 PM

Hi MArinus

With the equation provided by Jack, we can´t make a reciprocal, unfortunately. The divisor is another polynomial.

Jack, i gave a try using Log10(5+1) with the formula and the precision is something around 5 digits after the "." only. If you input 5 as x, the result of log(x+1) = log(6) will turn onto:

(0.434294481903251828+(1.0658978447365901+(0.944333131990812123+(0.365595021795863521+(0.589858567636932956e-1+(0.298394177553741882e-2+0.100435574013167601e-4*x)*x)*x)*x)*x)*x)*x, x=5
= 607.096994200488817

(1+(2.95432048794495173+(3.31823430318176310+(1.76615730286299464+(0.451400626940937546+(0.492131362215352252e-1+0.158489557252300951e-2*x)*x)*x)*x)*x)*x) , x=5
= 780.177558728198735

607.096994200488817/780.177558728198735 = 0.778152341615854664455814528055710104487791345037987518732

Expected log10(6) = 0.7781512503836436325087667979796083359683187456528044061402931014.

0.778152341615854664455814528055710104487791345037987518732 ; result using maple
0.7781512503836436325087667979796083359683187456528044061402931014 ; expected result

Can you please give a try to see if the precision can be extended to at least 14 digits after the "." and also keeping the amount of polynomial "x" to be used (or simplifying would be better) ? I mean it is a equation where the numerator is a equation on the form of x^7+x^6+... and the divisor x^6+x^5... It can be reformulated as:
(https://i.ibb.co/0mTnQ11/gfds-Image2.png) (https://imgbb.com/)

where A, B, C...are the values you posted and "x" the inputed value to calculate. We could try to put the numerator on a matrix and the divisor on other matrix and try to divide the matrix using the inversal of the divisor (The equation with x^6+...), but calculating the inverse matrix and also needing to check later if it can be divided will take a lot of time to process too.

Please, see if the numbers you created with Maple can be extended to at least 14 digits precision (after the "."), and also try simplifying the equation so we can try to see if it´s faster then the one i made.

Title: Re: Fast Log10 approximation
Post by: guga on August 17, 2020, 05:20:27 PM

Quote from: jj2007 on August 17, 2020, 10:20:09 AM
Quote from: HSE on August 16, 2020, 07:47:28 AM
:thumbsup:

Code Select Expand
AMD A6-3500 APU with Radeon(tm) HD Graphics (SSE3) 5396 cycles for 100 * Sse2_log10_precise (Guga SSE2 Log10 precise ) 11737 cycles for 100 * SmplMath: fSlv MyReal8 = log(MyExpo)

Code Select Expand
Intel(R) Core(TM) i5-2450M CPU @ 2.50GHz (SSE4) 3163 cycles for 100 * Sse2_log10_precise (Guga SSE2 Log10 precise ) 5060 cycles for 100 * Log10 3211 cycles for 100 * Sse2_log10_precise (Guga SSE2 Log10 precise ) 4933 cycles for 100 * Log10 3209 cycles for 100 * Sse2_log10_precise (Guga SSE2 Log10 precise ) 4823 cycles for 100 * Log10 3186 cycles for 100 * Sse2_log10_precise (Guga SSE2 Log10 precise ) 4773 cycles for 100 * Log10 3219 cycles for 100 * Sse2_log10_precise (Guga SSE2 Log10 precise ) 4739 cycles for 100 * Log10 536 bytes for Sse2_log10_precise (Guga SSE2 Log10 precise ) 16 bytes for Log10 0.699076046207356 for Sse2_log10_precise (Guga SSE2 Log10 precise ) 0.698970004336019 for Log10 0.6989700043360188 expected

Hi JJ.
I think you used the older version. The new one produces the correct value:

0.698970004336019 for Sse2_log10_precise (Guga SSE2 Log10 precise of 5)

The error is on the last 2 digits only :) Your version have less, digits then mine. So, on RosAsm, the full number (18 digits) is: 0.698970004336018857 with only the last 2 digits that looses precision.

Please, try with FastLog10a.zip i posted a few comments earlier :)

I succeeded to optimize a bit further, but i´m trying to do the proper math to optimize it even more and avoid using so many registers. I need now to optimize this part (This modification below also produces the correct result and gained extra speed):

Code Select


(...)
     ; same as SSE_Place_Log0
    unpcklpd XMM2 XMM2 | mulpd XMM2 X$SSE_Log1020 | addpd XMM2 XMM3 | addpd XMM4 XMM2

    movupd XMM1 X$Log_Coeff0
    movupd XMM2 XMM0 | mulpd XMM2 XMM2 | mulsd XMM2 XMM2 | mulsd XMM2 XMM0
    mulpd XMM1 XMM0 | addpd XMM1 X$SSE_Log10Var1

    mulpd XMM1 XMM0
    addpd XMM1 X$SSE_Log10Var2 | mulpd XMM1 XMM2

    movupd xmm2 X$SSE_Log10Var3 | mulpd XMM2 XMM0

    ;--------------------
    ; all necessary data are stored in both Packed Double paisr in registers xmm4, xmm1 and xmm2. Modified original version. With shuffle is faster
    ; We only need to sum all of them
    addpd xmm1 xmm4 ; sum all double quads from xmm1 and xmm4. xmm1 = xmm1+xmm4
    addpd xmm1 xmm2 ; sum both doubles of the result above with both doubles of xmm2. xmm1 => xmm1+xmm4+xmm2
    movupd xmm0 xmm1 ; Now we need to sum both doubles of the resultrant register in xmm1. To do that we copy xmm1 to xmm0
    pshufd XMM0 XMM0 SSE_SWAP_QWORDS ; and swap their pairs of doubles. On this way we have in xmm0 and xmm1 the inverted order of the pairs. SSE_SWAP_QWORDS = 78
    addpd xmm0 xmm1 ; and we only need now to sum them up. xmm0 = xmm1 => xmm0 = xmm1+xmm4+xmm2

The math to do this is the formula below ("Lo" and "Hi" are the low and Hi parts of the Double quadword on each register or variable):

Hiquadxmm0 = xmm0.Lo^2 * ( (Log_Coeff0.Lo*xmm0.Lo+SSE_Log10Var1.Lo)* xmm0.Lo+ SSE_Log10Var2.Lo) + xmm4.Lo + (Log10Var3.Lo*xmm0.Lo)
Loquadxmm0 = xmm0.Hi^5 * ( (Log_Coeff0.Hi*xmm0.Hi+SSE_Log10Var1.Hi)* xmm0.Hi+ SSE_Log10Var2.Hi) + xmm4.Hi + (Log10Var3.Hi*xmm0.Hi)

Final xmm0 = Hiquadxmm0 + Loquadxmm0

I´m trying to recreate the above equations using less registers (forcing it to use only xmm0 and xmm1 and xmm2) and also keeping the speed, with shorter computations, but it is hard to find the proper way.

I tried this below, but the result is incorrect. I´m missing something.

Code Select



;---------------------------- This is good for speed, but i´m missing some calculation, because it now produced a incorrect value.

    ;   1st step xmm4.Hi + (Log10Var3.Hi*xmm0.Hi)
    movupd xmm2 X$SSE_Log10Var3 | mulpd XMM2 XMM0 | addpd xmm2 xmm4
    ;   ( (Log_Coeff0.Hi*xmm0.Hi+SSE_Log10Var1.Hi)* xmm0.Hi+ SSE_Log10Var2.Hi)
    movupd XMM1 X$Log_Coeff0 | mulpd XMM1 XMM0 | addpd XMM1 X$SSE_Log10Var2
    ;   since xmm0.hi and xmm.lo are parto f the same xmm0 we can simply multiply this to get
    ;   Lo and Hi Quad xmm0. For the Lo Quad we must then simply mul by Loquad mnore 3 times to get xmm0.Hi^5
    mulpd XMM1 XMM0 | mulpd XMM1 XMM0
    ;   ok, now we get the Hi Part xmm0.Lo^2 * ( (Log_Coeff0.Lo*xmm0.Lo+SSE_Log10Var1.Lo)* xmm0.Lo+ SSE_Log10Var2.Lo)
    ;   To we get the LoPart we do:
    mulsd xmm1 xmm0 | mulsd xmm1 xmm0 | mulsd xmm1 xmm0
    ;   Now we can finally add xmm2 to xmm1
    addpd xmm1 xmm2
    ;   And finally we Exchange the data to add both hi and lo quads
    movupd xmm0 xmm1 ; Now we need to sum both doubles of the resultrant register in xmm1. To do that we copy xmm1 to xmm0
    pshufd XMM0 XMM0 SSE_SWAP_QWORDS ; and swap their pairs of doubles. On this way we have in xmm0 and xmm1 the inverted order of the pairs
    addpd xmm0 xmm1 ; and we only need now to sum them up. xmm0 = xmm1 => xmm0 = xmm1+xmm4+xmm2

Title: Re: Fast Log10 approximation
Post by: Siekmanski on August 17, 2020, 07:27:53 PM

Hi Guga,

You can create a reciprocal with these SSE2 instructions:
rcpps - Approximates the reciprocal of 4 packed floats.
rcpss - Approximates the reciprocal of a single float.

Title: Re: Fast Log10 approximation
Post by: guga on August 17, 2020, 08:04:02 PM

Hi marinus.

Great :thumbsup: :thumbsup: :thumbsup: :thumbsup:

But, how to apply it on the previous code to we minimize the usage of so many registers and make it shorter and faster ? (Mine version does not use divisions)

I´m trying to simplify even further this:

Hiquadxmm0 = xmm0.Lo^2 * ( (Log_Coeff0.Lo*xmm0.Lo+SSE_Log10Var1.Lo)* xmm0.Lo+ SSE_Log10Var2.Lo) + xmm4.Lo + (Log10Var3.Lo*xmm0.Lo)
Loquadxmm0 = xmm0.Hi^5 * ( (Log_Coeff0.Hi*xmm0.Hi+SSE_Log10Var1.Hi)* xmm0.Hi+ SSE_Log10Var2.Hi) + xmm4.Hi + (Log10Var3.Hi*xmm0.Hi)
Final xmm0 = Hiquadxmm0 + Loquadxmm0

But, didn´t suceeded yet

Title: Re: Fast Log10 approximation
Post by: Siekmanski on August 17, 2020, 08:20:41 PM

Just noticed you use real8.

unfortunately there are no rcppd and rcpsd instructions.

Title: Re: Fast Log10 approximation
Post by: guga on August 17, 2020, 08:50:53 PM

Ok, I think i got it working as expected.

I had to do this:

; The formula to retrieve xmm0 is as follows:
; Hiquadxmm0 = xmm0.Lo^2 * ( (Log_Coeff0.Lo*xmm0.Lo+SSE_Log10Var1.Lo)* xmm0.Lo+ SSE_Log10Var2.Lo) + xmm4.Lo + (Log10Var3.Lo*xmm0.Lo)
; Loquadxmm0 = xmm0.Hi^5 * ( (Log_Coeff0.Hi*xmm0.Hi+SSE_Log10Var1.Hi)* xmm0.Hi+ SSE_Log10Var2.Hi) + xmm4.Hi + (Log10Var3.Hi*xmm0.Hi)
; Final xmm0 = Hiquadxmm0 + Loquadxmm0

; 1st step we calculate low and hi values of xmm4.Hi + (Log10Var3.Hi*xmm0.Hi) and xmm4.Lo + (Log10Var3.Lo*xmm0.Lo)
movupd xmm2 X$SSE_Log10Var3 | mulpd XMM2 XMM0 | addpd xmm2 xmm4 ; Save the result in xmm2

; Now. What do both have in common ? Both have in common this:
; xmm0.Lo * ( (Log_Coeff0.Lo*xmm0.Lo+SSE_Log10Var1.Lo)* xmm0.Lo+ SSE_Log10Var2.Lo)
; xmm0.Hi * ( (Log_Coeff0.Hi*xmm0.Hi+SSE_Log10Var1.Hi)* xmm0.Hi+ SSE_Log10Var2.Hi)
; So we compute this only once and save it on a xmm1 register to we get both low and hi values

; ( (Log_Coeff0.Hi*xmm0.Hi+SSE_Log10Var1.Hi)* xmm0.Hi+ SSE_Log10Var2.Hi)
movupd XMM1 X$Log_Coeff0 | mulpd XMM1 XMM0 | addpd XMM1 X$SSE_Log10Var1 | mulpd xmm1 xmm0 | addpd XMM1 X$SSE_Log10Var2

; now we get the LoQuad xmm0.Lo^2 * ( (Log_Coeff0.Lo*xmm0.Lo+SSE_Log10Var1.Lo)* xmm0.Lo+ SSE_Log10Var2.Lo)
; and the hi quad xmm0.Lo^2 * ( (Log_Coeff0.Lo*xmm0.Lo+SSE_Log10Var1.Lo)* xmm0.Lo+ SSE_Log10Var2.Lo)
; at once. (Because they share the same values of xmm0.Lo^2)
mulpd xmm1 xmm0 | mulpd xmm1 xmm0; xmm0.Lo^2 is calculated and applied in both low and hi doubles of xmm1

; let´ now do the same for xmm0.Hi^5
; since we already calculated xmm0.Hi^2, we need only to multiply 3 times with xmm0 to we get the value of Loquadxmm0
mulsd xmm1 xmm0 | mulsd xmm1 xmm0 | mulsd xmm1 xmm0

; Now we can finally add xmm2 to xmm1
addpd xmm1 xmm2

; And finally we Exchange the data to add both hi and lo quads

movupd xmm0 xmm1 ; Now we need to sum both doubles of the resultrant register in xmm1. To do that we copy xmm1 to xmm0
pshufd xmm0 xmm0 SSE_SWAP_QWORDS ; and swap their pairs of doubles. On this way we have in xmm0 and xmm1 the inverted order of the pairs. This is a simple equate whose value is 78 to perform the swap from pshufd
addpd xmm0 xmm1 ; and we only need now to sum them up. xmm0 = xmm1 => xmm0 = xmm1+xmm4+xmm2

Although i removed extra registers, on this part of the code, it still need 17 instructions to calculate the final result at xmm0

I wonder if this can be optimized even further in SSE2

Title: Re: Fast Log10 approximation
Post by: jj2007 on August 17, 2020, 09:17:56 PM

Quote from: guga on August 17, 2020, 05:20:27 PMHi JJ.
I think you used the older version. The new one produces the correct value:

0.698970004336019 for Sse2_log10_precise (Guga SSE2 Log10 precise of 5)

The error is on the last 2 digits only :) Your version have less, digits then mine. So, on RosAsm, the full number (18 digits) is: 0.698970004336018857 with only the last 2 digits that looses precision.

Please, try with FastLog10a.zip i posted a few comments earlier :)

The only changes necessary are useC=0 and in TestB, line 357:
SetFloat MyReal8=Log10(MyExpo)
instead of ExpXY(MyBaseB, MyExpo, MyReal8)

Code Select

Intel(R) Core(TM) i5-2450M CPU @ 2.50GHz (SSE4)

3157    cycles for 100 * Sse2_log10_precise (Guga SSE2 Log10 precise of 5)
4795    cycles for 100 * MasmBasic Log10

3166    cycles for 100 * Sse2_log10_precise (Guga SSE2 Log10 precise of 5)
4830    cycles for 100 * MasmBasic Log10

3145    cycles for 100 * Sse2_log10_precise (Guga SSE2 Log10 precise of 5)
4772    cycles for 100 * MasmBasic Log10

3212    cycles for 100 * Sse2_log10_precise (Guga SSE2 Log10 precise of 5)
4767    cycles for 100 * MasmBasic Log10

3145    cycles for 100 * Sse2_log10_precise (Guga SSE2 Log10 precise of 5)
4765    cycles for 100 * MasmBasic Log10

0.698970004336019 for Sse2_log10_precise (Guga SSE2 Log10 precise of 5)
0.698970004336019 for MasmBasic Log10

We are close :smiley:

As regards precision:

include \masm32\MasmBasic\MasmBasic.inc ; download (http://masm32.com/board/index.php?topic=94.0)
SetGlobals MyReal10:REAL10
Init
PrintLine "0.698970004336018804786 (expected)"
SetFloat MyReal10=Log10(5.0)
Print Str$(MyReal10)
EndOfCode

Code Select

0.698970004336018804786 (expected)
0.6989700043360188048

Title: Re: Fast Log10 approximation
Post by: guga on August 17, 2020, 09:47:28 PM

Quote from: jj2007 on August 17, 2020, 09:17:56 PM
Quote from: guga on August 17, 2020, 05:20:27 PMHi JJ.
I think you used the older version. The new one produces the correct value:

0.698970004336019 for Sse2_log10_precise (Guga SSE2 Log10 precise of 5)

The error is on the last 2 digits only :) Your version have less, digits then mine. So, on RosAsm, the full number (18 digits) is: 0.698970004336018857 with only the last 2 digits that looses precision.

Please, try with FastLog10a.zip i posted a few comments earlier :)

The only changes necessary are useC=0 and in TestB, line 357:
SetFloat MyReal8=Log10(MyExpo)
instead of ExpXY(MyBaseB, MyExpo, MyReal8)

Code Select Expand
Intel(R) Core(TM) i5-2450M CPU @ 2.50GHz (SSE4) 3157 cycles for 100 * Sse2_log10_precise (Guga SSE2 Log10 precise of 5) 4795 cycles for 100 * MasmBasic Log10 3166 cycles for 100 * Sse2_log10_precise (Guga SSE2 Log10 precise of 5) 4830 cycles for 100 * MasmBasic Log10 3145 cycles for 100 * Sse2_log10_precise (Guga SSE2 Log10 precise of 5) 4772 cycles for 100 * MasmBasic Log10 3212 cycles for 100 * Sse2_log10_precise (Guga SSE2 Log10 precise of 5) 4767 cycles for 100 * MasmBasic Log10 3145 cycles for 100 * Sse2_log10_precise (Guga SSE2 Log10 precise of 5) 4765 cycles for 100 * MasmBasic Log10 0.698970004336019 for Sse2_log10_precise (Guga SSE2 Log10 precise of 5) 0.698970004336019 for MasmBasic Log10

We are close :smiley:

As regards precision:

include \masm32\MasmBasic\MasmBasic.inc ; download (http://masm32.com/board/index.php?topic=94.0)
SetGlobals MyReal10:REAL10
Init
PrintLine "0.698970004336018804786 (expected)"
SetFloat MyReal10=Log10(5.0)
Print Str$(MyReal10)
EndOfCode

Code Select Expand
0.698970004336018804786 (expected) 0.6989700043360188048

Great !. Thanks, JJ. I´ll change the code and compare to yours. I´m trying to see if i can optimize a bit further before post a newer version :)

Title: Re: Fast Log10 approximation
Post by: guga on August 19, 2020, 12:46:13 AM

New version.

A bit faster (Something around 10% faster). :thumbsup:

I think i reached my limit of optimization :bgrin: :bgrin:. If someone wants to give a try and optimize further, please do :greensml: :greensml: :greensml:

Accuracy not affected by the current optimization.

Precision is:
16 digits after the "." for normalized values
12 digits after the "." for denormalized values

My timmings:

Code Select


AMD Ryzen 5 2400G with Radeon Vega Graphics     (SSE4)

2640    cycles for 100 * Sse2_log10_precise (Guga SSE2 Log10 precise of 5)
6699    cycles for 100 * Log10 (MasmBasic, Log10 of 5)

2507    cycles for 100 * Sse2_log10_precise (Guga SSE2 Log10 precise of 5)
6765    cycles for 100 * Log10 (MasmBasic, Log10 of 5)

2471    cycles for 100 * Sse2_log10_precise (Guga SSE2 Log10 precise of 5)
6774    cycles for 100 * Log10 (MasmBasic, Log10 of 5)

2516    cycles for 100 * Sse2_log10_precise (Guga SSE2 Log10 precise of 5)
6785    cycles for 100 * Log10 (MasmBasic, Log10 of 5)

2569    cycles for 100 * Sse2_log10_precise (Guga SSE2 Log10 precise of 5)
6600    cycles for 100 * Log10 (MasmBasic, Log10 of 5)

0.698970004336019 for Sse2_log10_precise (Guga SSE2 Log10 precise of 5)
0.698970004336019 for Log10 (MasmBasic, Log10 of 5)

--- ok ---

Updated version also on the 1st post

Title: Re: Fast Log10 approximation
Post by: jj2007 on August 19, 2020, 01:06:03 AM

:thumbsup:

Code Select

Intel(R) Core(TM) i5-2450M CPU @ 2.50GHz (SSE4)

2867    cycles for 100 * Sse2_log10_precise (Guga SSE2 Log10 precise of 5)
4731    cycles for 100 * Log10 (MasmBasic, Log10 of 5)

2860    cycles for 100 * Sse2_log10_precise (Guga SSE2 Log10 precise of 5)
4780    cycles for 100 * Log10 (MasmBasic, Log10 of 5)

2834    cycles for 100 * Sse2_log10_precise (Guga SSE2 Log10 precise of 5)
4775    cycles for 100 * Log10 (MasmBasic, Log10 of 5)

2878    cycles for 100 * Sse2_log10_precise (Guga SSE2 Log10 precise of 5)
4760    cycles for 100 * Log10 (MasmBasic, Log10 of 5)

2858    cycles for 100 * Sse2_log10_precise (Guga SSE2 Log10 precise of 5)
4742    cycles for 100 * Log10 (MasmBasic, Log10 of 5)

0.698970004336019 for Sse2_log10_precise (Guga SSE2 Log10 precise of 5)
0.698970004336019 for Log10 (MasmBasic, Log10 of 5)

Title: Re: Fast Log10 approximation
Post by: guga on August 19, 2020, 02:04:22 AM

Great !

Small improve.
Replace the cmp ecx, 1023; with a test ecx (not 1023), like this:

Code Select

		test ecx, 0FFFFFC00h
		 jz     loc_4C6BFC
;                 cmp     ecx, 1023; Check for special cases (NAN, INF etc). Otherwise, jmp to the start of computation
;                 jbe     loc_4C6BFC

It should speed up more 2 or 3 % :bgrin:

Btw...take a look at the source. I commented it a little bit and also changed the Tables to hold the proper values. Log10_Table_T and SSE_Log1020 . Now the 2nd Real8 on both tables can be zeroed and later improve a bit more using a array of 2 Real 8 instead of 4 Real8 values on table Log10_Table_T and Log10_Table_B and perhaps using mulsd in some parts of teh code rather mulpd. Didn´t checked this part yet, but i guess that it is fast and precise enough already :)

Log10_Table_T table is formed by the log10(x) where x = 1 to 2

I didn´t found out yet how those values where generated, but it was calculated as follow:

Code Select


; Log10_Table_T is formed by a log10(x) where x is the same as below:
; dq log10(1), log10(1), log10(1.01587301587301580), log10(1)
; dq log10(1.03121852970795570), log10(1),  log10(1.04703476482617600), log10(1)
; dq log10(1.06224066390041490), log10(1),  log10(1.07789473684210520), log10(1)
; dq log10(1.09401709401709410), log10(1),  log10(1.10942578548212340), log10(1)
; dq log10(1.12527472527472530), log10(1),  log10(1.14031180400890860), log10(1)
; dq log10(1.15640880858272150), log10(1),  log10(1.17162471395881010), log10(1)
; dq log10(1.18724637681159420), log10(1),  log10(1.20329024676850760), log10(1)
; dq log10(1.21904761904761890), log10(1),  log10(1.23447860156720910), log10(1)
; dq log10(1.25030525030525030), log10(1),  log10(1.26576019777503080), log10(1)
; dq log10(1.28160200250312890), log10(1),  log10(1.29702343255224830), log10(1)
; dq log10(1.31282051282051280), log10(1),  log10(1.32814526588845650), log10(1)
; dq log10(1.34383202099737530), log10(1),  log10(1.35899137358991370), log10(1)
; dq log10(1.37541974479516460), log10(1),  log10(1.39035980991174470), log10(1)
; dq log10(1.40659340659340670), log10(1),  log10(1.42222222222222210), log10(1)
; dq log10(1.43719298245614050), log10(1),  log10(1.45351312987934710), log10(1)
; dq log10(1.46915351506456250), log10(1),  log10(1.48405797101449280), log10(1)
; dq log10(1.50036630036630040), log10(1),  log10(1.51591413767579560), log10(1)
; dq log10(1.53178758414360510), log10(1),  log10(1.54682779456193350), log10(1)
; dq log10(1.56216628527841350), log10(1),  log10(1.57781201848998460), log10(1)
; dq log10(1.59377431906614800), log10(1),  log10(1.60879811468970920), log10(1)
; dq log10(1.62539682539682540), log10(1),  log10(1.64102564102564120), log10(1)
; dq log10(1.65561843168957170), log10(1),  log10(1.67183673469387760), log10(1)
; dq log10(1.68698517298187810), log10(1),  log10(1.70382695507487520), log10(1)
; dq log10(1.71812080536912770), log10(1),  log10(1.73412362404741740), log10(1)
; dq log10(1.75042735042735040), log10(1),  log10(1.76551724137931030), log10(1)
; dq log10(1.78086956521739140), log10(1),  log10(1.79649122807017550), log10(1)
; dq log10(1.81238938053097340), log10(1),  log10(1.82857142857142850), log10(1)
; dq log10(1.84338433843384330), log10(1),  log10(1.86012715712988190), log10(1)
; dq log10(1.87545787545787520), log10(1),  log10(1.89104339796860570), log10(1)
; dq log10(1.90689013035381750), log10(1),  log10(1.92120075046904320), log10(1)
; dq log10(1.93755912961210970), log10(1),  log10(1.95233555767397520), log10(1)
; dq log10(1.96923076923076930), log10(1),  log10(1.98449612403100770), log10(1)
; dq log10(2), log10(1)

But, how the value of 1.01587301587301580, 1.87545787545787520, 1.98449612403100770 were generated, i have no idea yet

Title: Re: Fast Log10 approximation
Post by: daydreamer on August 19, 2020, 09:51:59 AM

what about you check cpuid for SSE3 is on cpu and have a version with horizontal add,HADDPD instead of
movupd | shufps addps
would that be few cycles faster?

Title: Re: Fast Log10 approximation
Post by: guga on August 19, 2020, 11:34:08 AM

Quote from: daydreamer on August 19, 2020, 09:51:59 AM
what about you check cpuid for SSE3 is on cpu and have a version with horizontal add,HADDPD instead of
movupd | shufps addps
would that be few cycles faster?

Hi DayDreamer. probably it would be faster, but i don´t have how to assemble in SSE3 or SSE4. I didn´t have time to updated RosAsm to work with those yet. That´s why i´m using SSE2 trying to get the maximum speed of it (also could help others that don´t have newer processors)

Title: Re: Fast Log10 approximation
Post by: daydreamer on August 20, 2020, 04:47:00 AM

Quote from: guga on August 19, 2020, 11:34:08 AM
Quote from: daydreamer on August 19, 2020, 09:51:59 AM
what about you check cpuid for SSE3 is on cpu and have a version with horizontal add,HADDPD instead of
movupd | shufps addps
would that be few cycles faster?

Hi DayDreamer. probably it would be faster, but i don´t have how to assemble in SSE3 or SSE4. I didn´t have time to updated RosAsm to work with those yet. That´s why i´m using SSE2 trying to get the maximum speed of it (also could help others that don´t have newer processors)

hows the macro caps for rosasm?possible to make SSE3 and SSE4 macros?
http://www.masmforum.com/board/index.php?topic=973.0 (http://www.masmforum.com/board/index.php?topic=973.0)

Title: Re: Fast Log10 approximation
Post by: guga on August 20, 2020, 07:38:43 AM

Quote from: daydreamer on August 20, 2020, 04:47:00 AM
Quote from: guga on August 19, 2020, 11:34:08 AM
Quote from: daydreamer on August 19, 2020, 09:51:59 AM
what about you check cpuid for SSE3 is on cpu and have a version with horizontal add,HADDPD instead of
movupd | shufps addps
would that be few cycles faster?

Hi DayDreamer. probably it would be faster, but i don´t have how to assemble in SSE3 or SSE4. I didn´t have time to updated RosAsm to work with those yet. That´s why i´m using SSE2 trying to get the maximum speed of it (also could help others that don´t have newer processors)
hows the macro caps for rosasm?possible to make SSE3 and SSE4 macros?
http://www.masmforum.com/board/index.php?topic=973.0 (http://www.masmforum.com/board/index.php?topic=973.0)

You mean using something like this ?

Code Select


I'm having a little trouble including these macros.

This one doesn't look right:
Code:
ORPD MACRO M1,M2
    DB 066H
    ORPS MACRO M1,M2
ENDM

After fixing that, it hates these two:

Code:
CMPLTSD MACRO M1,M2
    DB 0F2H
    CMPLTPS M1,M2
END
and
Code:
CMPSD MACRO M1,M2,M3
    DB 0F2H
    CMPPS M1,M2,M3
ENDM

Never tried creating a pseudo-instruction as a macro by hand before.

In RosAsm we could hard code it with something like: DB 01 025 070

The problem is that it will be extremely hard to follow. Better would be implement SSE3 and SSE4 in RosAsm. I have to do it eventually. I just can´t implement it right now due to lack of time and several things to fix in RosAsm yet before implement SSE3/SSE4. I still need to try find some time to detach RosAsm internal code and create dlls for usage on the main tools, such as the disassembler, debugger, resources editor, forms creator and even the encoder. All of this would be better to be on their own dlls rather then a monosource as it is already.

Macros in RosAsm works inside "[" and "]" . The 1st bracket must be immediately followed by a separator "|". Like this:

[HIWORD | mov eax #1 | shr eax 16]

or the normal If Chain.

[If | cmp #1 #3 | jn#2 I1>]
[Else_if | jmp I9> | I1: | cmp #1 #3 | jn#2 I1>]
[Else | Jmp I9> | I1:]
[End_if | I1: | I9:]

Of course, this is not rigid syntax. It´s just the default macro set where the user can choose to use it or not or even write his own macro set.

Title: Re: Fast Log10 approximation
Post by: Siekmanski on August 20, 2020, 08:13:58 AM

If you need a fast SSE2 horizontal addition for 4 packed floats,

Code Select

    movaps  xmm1,xmm0
    shufps  xmm1,xmm0,10110001b
    addps   xmm0,xmm1
    movhlps xmm1,xmm0
    addss   xmm0,xmm1

Title: Re: Fast Log10 approximation
Post by: guga on August 20, 2020, 08:28:46 AM

Quote from: Siekmanski on August 20, 2020, 08:13:58 AM
If you need a fast SSE2 horizontal addition for 4 packed floats,

Code Select Expand
movaps xmm1,xmm0 shufps xmm1,xmm0,10110001b addps xmm0,xmm1 movhlps xmm1,xmm0 addss xmm0,xmm1

Great. Many thanks marinus.

This could be easy and very usefull to create a macro to simulate the haddps opcodes :thumbsup: :thumbsup: :thumbsup: :thumbsup: :thumbsup:

Title: Re: Fast Log10 approximation
Post by: daydreamer on August 20, 2020, 11:02:46 PM

Quote from: guga on August 20, 2020, 08:28:46 AM
Quote from: Siekmanski on August 20, 2020, 08:13:58 AM
If you need a fast SSE2 horizontal addition for 4 packed floats,

Code Select Expand
movaps xmm1,xmm0 shufps xmm1,xmm0,10110001b addps xmm0,xmm1 movhlps xmm1,xmm0 addss xmm0,xmm1

Great. Many thanks marinus.

This could be easy and very usefull to create a macro to simulate the haddps opcodes :thumbsup: :thumbsup: :thumbsup: :thumbsup: :thumbsup:

:thumbsup: :thumbsup:
replace mnemonic HADDPS with macro with the same name in masm just use NOKEYWORD in the very beginning of source file with those mnemonics you want to reprogram to macros
and now I probaby get to be flamed by bare metal coders for this heresy :bgrin:

Title: Re: Fast Log10 approximation
Post by: guga on August 21, 2020, 01:18:40 AM

Hi Daydreamer :bgrin: :bgrin: :bgrin:

I´ll probably do this for RosAsm. A set of macros in SSE2 to simulate the behaviour of SSE3 and SSE4 are needed for Masm and RosAsm, specially when we want to create code that adapt to the user´s processor (or, in my case where i didn´t built SSE3/SSe4 opcodes yet).

These could be very handy :azn: :azn: :azn:

Title: Re: Fast Log10 approximation
Post by: jj2007 on September 02, 2020, 10:04:11 AM

Hi Guga & friends,

Can I have some timings, please? This is work in progress, building requires MasmBasic version 2 September 2020 (http://masm32.com/board/index.php?topic=94.0). The core is here:

FastMath FastLog10 ; define a math function
For_ fct=0.0 To 10.0 Step 0.5
fld fct ; X
fstp REAL10 ptr [edi]
void Log10(fct) ; Y (built-in MasmBasic function)
fstp REAL10 ptr [edi+REAL10]
add edi, 2*REAL10
Next
FastMath ; -------- done -------------

Usage in the attached source (*.asc opens in RichMasm, WordPad, MS Word):

Code Select

TestC proc
  mov ebx, AlgoLoops-1	; loop e.g. 100x
  align 4
  .Repeat
	void FastLog10(MyExpo)   ; <<<<<<< put result in ST(0)
	fstp res8
	dec ebx
  .Until Sign?
  ret
TestC endp

Timings:

Code Select

Intel(R) Core(TM) i5-2450M CPU @ 2.50GHz (SSE4)
2244 µs for initialising FastLog10

3170    cycles for 100 * Log10 (Guga)
4783    cycles for 100 * MasmBasic Log10
1553    cycles for 100 * FastMath Log10

3148    cycles for 100 * Log10 (Guga)
4795    cycles for 100 * MasmBasic Log10
1554    cycles for 100 * FastMath Log10

3173    cycles for 100 * Log10 (Guga)
4788    cycles for 100 * MasmBasic Log10
1552    cycles for 100 * FastMath Log10

3319    cycles for 100 * Log10 (Guga)
4781    cycles for 100 * MasmBasic Log10
1552    cycles for 100 * FastMath Log10

3175    cycles for 100 * Log10 (Guga)
4786    cycles for 100 * MasmBasic Log10
1540    cycles for 100 * FastMath Log10

536     bytes for Log10 (Guga)
16      bytes for MasmBasic Log10
162     bytes for FastMath Log10

Real8   0.6989700043360187465
Real8   0.6989700043360188575
Real8   0.6989700043360188575

Title: Re: Fast Log10 approximation
Post by: Siekmanski on September 02, 2020, 03:40:53 PM

Code Select

Intel(R) Core(TM) i7-4930K CPU @ 3.40GHz (SSE4)
28 µs for initialising FastLog10

3470    cycles for 100 * Log10 (Guga)
5838    cycles for 100 * MasmBasic Log10
1912    cycles for 100 * FastMath Log10

3473    cycles for 100 * Log10 (Guga)
5838    cycles for 100 * MasmBasic Log10
1907    cycles for 100 * FastMath Log10

3484    cycles for 100 * Log10 (Guga)
5833    cycles for 100 * MasmBasic Log10
1916    cycles for 100 * FastMath Log10

3481    cycles for 100 * Log10 (Guga)
5834    cycles for 100 * MasmBasic Log10
1913    cycles for 100 * FastMath Log10

3477    cycles for 100 * Log10 (Guga)
5838    cycles for 100 * MasmBasic Log10
1913    cycles for 100 * FastMath Log10

536     bytes for Log10 (Guga)
16      bytes for MasmBasic Log10
162     bytes for FastMath Log10

Real8   0.6989700043360187465
Real8   0.6989700043360188575
Real8   0.6989700043360188575

--- ok ---

Title: Re: Fast Log10 approximation
Post by: jj2007 on September 02, 2020, 10:40:45 PM

New version, I had forgotten to switch off the range checks:

Code Select

Intel(R) Core(TM) i5-2450M CPU @ 2.50GHz (SSE4)
88 µs for initialising FastLog10

2882    cycles for 100 * Log10 (Guga)
4780    cycles for 100 * MasmBasic Log10
1399    cycles for 100 * FastMath Log10

2918    cycles for 100 * Log10 (Guga)
5372    cycles for 100 * MasmBasic Log10
1502    cycles for 100 * FastMath Log10

2912    cycles for 100 * Log10 (Guga)
4787    cycles for 100 * MasmBasic Log10
1397    cycles for 100 * FastMath Log10

2882    cycles for 100 * Log10 (Guga)
4774    cycles for 100 * MasmBasic Log10
1387    cycles for 100 * FastMath Log10

2881    cycles for 100 * Log10 (Guga)
4774    cycles for 100 * MasmBasic Log10
1392    cycles for 100 * FastMath Log10

516     bytes for Log10 (Guga)
16      bytes for MasmBasic Log10
67      bytes for FastMath Log10

Real8   0.6989700043360187465
Real8   0.6989700043360188575
Real8   0.6989700043360188575

Guga's version is g from this post (http://masm32.com/board/index.php?topic=8744.msg95585#msg95585).

Title: Re: Fast Log10 approximation
Post by: Siekmanski on September 02, 2020, 11:10:13 PM

Code Select

Intel(R) Core(TM) i7-4930K CPU @ 3.40GHz (SSE4)
78 µs for initialising FastLog10

3142    cycles for 100 * Log10 (Guga)
5851    cycles for 100 * MasmBasic Log10
1698    cycles for 100 * FastMath Log10

3144    cycles for 100 * Log10 (Guga)
5839    cycles for 100 * MasmBasic Log10
1694    cycles for 100 * FastMath Log10

3154    cycles for 100 * Log10 (Guga)
5834    cycles for 100 * MasmBasic Log10
1693    cycles for 100 * FastMath Log10

3142    cycles for 100 * Log10 (Guga)
5838    cycles for 100 * MasmBasic Log10
1695    cycles for 100 * FastMath Log10

3150    cycles for 100 * Log10 (Guga)
5836    cycles for 100 * MasmBasic Log10
1702    cycles for 100 * FastMath Log10

516     bytes for Log10 (Guga)
16      bytes for MasmBasic Log10
67      bytes for FastMath Log10

Real8   0.6989700043360187465
Real8   0.6989700043360188575
Real8   0.6989700043360188575

--- ok ---

Title: Re: Fast Log10 approximation
Post by: jj2007 on September 02, 2020, 11:19:43 PM

Here is another one, with a FastSqrt added:

FastMath FastSqrt ; define a math function
For_ fct=0.0 To 10.0 Step 0.5
fld fct ; X
fld st
fstp REAL10 ptr [edi]
fsqrt ; Y
fstp REAL10 ptr [edi+REAL10]
add edi, 2*REAL10
Next
FastMath ; -------- done -------------

The speed gain is very modest, though:

Code Select

1859    cycles for 100 * fsqrt
1398    cycles for 100 * FastSqrt

The tangens is more impressive:

Code Select

10920   cycles for 100 * fptan
1390    cycles for 100 * FastTan

Title: Re: Fast Log10 approximation
Post by: TouEnMasm on September 02, 2020, 11:50:51 PM

Hello,
I was trying to run your exe,and get this answer:

Quote
This file content an indesirable Virus and will be deleted soon

and the file is deleted.
With what did you compile your files ?

Title: Re: Fast Log10 approximation
Post by: jj2007 on September 03, 2020, 12:03:02 AM

With MASM (ML.exe), but it 'compiles' also with UAsm and AsmC. Quality AV products at VirusTotal say it's clean (https://www.virustotal.com/gui/file/328a31d9191bb1123660f0f722eaf5868690c58df925d818f15e48993f21d425/detection).

I suggest you move your crappy AV software into the recycle bin (http://masm32.com/board/index.php?board=23.0). Is it a French product? "This file content an indesirable Virus" should be contains :cool:

Title: Re: Fast Log10 approximation
Post by: TouEnMasm on September 03, 2020, 01:54:40 AM

Quote
I suggest you move your crappy AV software into the recycle bin.

You surely speak of Windows 10 Familial edition ,on a perfectly standard UC (4Go Mem,1 tera disk)

Title: Re: Fast Log10 approximation
Post by: Siekmanski on September 03, 2020, 07:46:58 AM

Code Select

Intel(R) Core(TM) i7-4930K CPU @ 3.40GHz (SSE4)
44 µs for initialising FastLog10

3175    cycles for 100 * Log10 (Guga)
5845    cycles for 100 * MasmBasic Log10
1693    cycles for 100 * FastMath Log10
1545    cycles for 100 * fsqrt
1698    cycles for 100 * FastSqrt

3149    cycles for 100 * Log10 (Guga)
5838    cycles for 100 * MasmBasic Log10
1692    cycles for 100 * FastMath Log10
1545    cycles for 100 * fsqrt
1690    cycles for 100 * FastSqrt

3147    cycles for 100 * Log10 (Guga)
5836    cycles for 100 * MasmBasic Log10
1690    cycles for 100 * FastMath Log10
1549    cycles for 100 * fsqrt
1692    cycles for 100 * FastSqrt

3142    cycles for 100 * Log10 (Guga)
5840    cycles for 100 * MasmBasic Log10
1696    cycles for 100 * FastMath Log10
1542    cycles for 100 * fsqrt
1695    cycles for 100 * FastSqrt

3140    cycles for 100 * Log10 (Guga)
5841    cycles for 100 * MasmBasic Log10
1692    cycles for 100 * FastMath Log10
1544    cycles for 100 * fsqrt
1691    cycles for 100 * FastSqrt

516     bytes for Log10 (Guga)
16      bytes for MasmBasic Log10
67      bytes for FastMath Log10
14      bytes for fsqrt
67      bytes for FastSqrt

Real8   0.6989700043360187465
Real8   0.6989700043360188575
Real8   0.6989700043360188575
Real8   2.236067977499789805
Real8   2.236067977499789805

--- ok ---

Title: Re: Fast Log10 approximation
Post by: six_L on September 03, 2020, 08:29:02 AM

QuoteIntel(R) Core(TM) i5-9400H CPU @ 2.50GHz (SSE4)
31 µs for initialising FastLog10

1622 cycles for 100 * Log10 (Guga)
3402 cycles for 100 * MasmBasic Log10
670 cycles for 100 * FastMath Log10
351 cycles for 100 * fsqrt
669 cycles for 100 * FastSqrt

1637 cycles for 100 * Log10 (Guga)
3474 cycles for 100 * MasmBasic Log10
669 cycles for 100 * FastMath Log10
328 cycles for 100 * fsqrt
667 cycles for 100 * FastSqrt

1583 cycles for 100 * Log10 (Guga)
3785 cycles for 100 * MasmBasic Log10
653 cycles for 100 * FastMath Log10
339 cycles for 100 * fsqrt
663 cycles for 100 * FastSqrt

1593 cycles for 100 * Log10 (Guga)
3320 cycles for 100 * MasmBasic Log10
653 cycles for 100 * FastMath Log10
341 cycles for 100 * fsqrt
663 cycles for 100 * FastSqrt

1590 cycles for 100 * Log10 (Guga)
3434 cycles for 100 * MasmBasic Log10
667 cycles for 100 * FastMath Log10
340 cycles for 100 * fsqrt
664 cycles for 100 * FastSqrt

516 bytes for Log10 (Guga)
16 bytes for MasmBasic Log10
67 bytes for FastMath Log10
14 bytes for fsqrt
67 bytes for FastSqrt

Real8 0.6989700043360187465
Real8 0.6989700043360188575
Real8 0.6989700043360188575
Real8 2.236067977499789805
Real8 2.236067977499789805

--- ok ---

Title: Re: Fast Log10 approximation
Post by: jj2007 on September 03, 2020, 09:28:45 AM

Siekmanski & six_L, your built-in fsqrt is faster than the police allowed :cool:

FastMath shines for more complex functions, see this post (http://masm32.com/board/index.php?topic=8779.0) where it replaces the GSL Bessel function (https://www.gnu.org/software/gsl/doc/html/usage.html). Speed gain is roughly a factor 55 on my trusty old Core i5 :tongue:

Title: Re: Fast Log10 approximation
Post by: TimoVJL on September 03, 2020, 03:29:02 PM

Code Select

AMD Ryzen 5 3400G with Radeon Vega Graphics     (SSE4)
36 µs for initialising FastLog10

2396    cycles for 100 * Log10 (Guga)
6316    cycles for 100 * MasmBasic Log10
1584    cycles for 100 * FastMath Log10
690     cycles for 100 * fsqrt
1518    cycles for 100 * FastSqrt

2418    cycles for 100 * Log10 (Guga)
6323    cycles for 100 * MasmBasic Log10
1542    cycles for 100 * FastMath Log10
699     cycles for 100 * fsqrt
1585    cycles for 100 * FastSqrt

2431    cycles for 100 * Log10 (Guga)
6362    cycles for 100 * MasmBasic Log10
1523    cycles for 100 * FastMath Log10
660     cycles for 100 * fsqrt
1598    cycles for 100 * FastSqrt

2552    cycles for 100 * Log10 (Guga)
6347    cycles for 100 * MasmBasic Log10
1523    cycles for 100 * FastMath Log10
660     cycles for 100 * fsqrt
1547    cycles for 100 * FastSqrt

2466    cycles for 100 * Log10 (Guga)
6344    cycles for 100 * MasmBasic Log10
1520    cycles for 100 * FastMath Log10
660     cycles for 100 * fsqrt
1518    cycles for 100 * FastSqrt

516     bytes for Log10 (Guga)
16      bytes for MasmBasic Log10
67      bytes for FastMath Log10
14      bytes for fsqrt
67      bytes for FastSqrt

Real8   0.6989700043360187465
Real8   0.6989700043360188575
Real8   0.6989700043360188575
Real8   2.236067977499789805
Real8   2.236067977499789805

-

Title: Re: Fast Log10 approximation
Post by: jj2007 on September 03, 2020, 07:31:49 PM

Code Select

340     cycles for 100 * fsqrt	Intel(R) Core(TM) i5-9400H (six_L)
660     cycles for 100 * fsqrt	AMD Ryzen 5 3400G (Timo)
1545    cycles for 100 * fsqrt	Intel(R) Core(TM) i7-4930K (Siekmanski)
1859    cycles for 100 * fsqrt	Intel(R) Core(TM) i5-2450M (jj2007)

8 years of progress... but why did they pick fsqrt? The other timings are not so different :rolleyes:

Title: Re: Fast Log10 approximation
Post by: Vortex on September 03, 2020, 09:36:54 PM

Quote from: TouEnMasm on September 03, 2020, 01:54:40 AM

Quote
I suggest you move your crappy AV software into the recycle bin.
You surely speak of Windows 10 Familial edition ,on a perfectly standard UC (4Go Mem,1 tera disk)

Hi ToutEnMasm,

Could you try exclude the false-positives in your antivirus settings?

Title: Re: Fast Log10 approximation
Post by: jj2007 on September 03, 2020, 11:49:52 PM

Any other candidates, i.e. slow math functions that deserve a speed boost?

Text only | Text with Images

SMF 2.1.4 © 2023, Simple Machines