Fast Log10 approximation

guga · August 16, 2020, 06:01:01 AM

Hi Guys

Another test. This time is a Fast Log10 function on the same way as we are testing the Fast exp on this thread http://masm32.com/board/index.php?topic=8734.0

I succeeded to assemble the file with masm basic (have small errors, but it is working as expected

). The function calculates the Log10 of a number with a precision of 16 digits after the ".".

Also, it checks for NAN, Zero and Negative and positive Infinite and calculates denormalized values.

The resultant value is stored in xmm0. (This time i succeded to make it use less xmm registers

)

Important: If you would assemble the function alone, it is mandatory to align the data on a 16 byte boundary due to the usage of 128 bit addressing of xmm registers. (I used memory addresses to minimize the usage of so many xmm registers)

Btw, plsease don´t botter the labels on the outpuuted console. I don´t know how to configure masmbasic to output the proper value

Original value of Log10(5)
Log10(5) = 0.698970004336018804786261105275506973231810118537891458689
My version:
Log10(5) = 0.698970004336018857

In my PC, the average speed is around 2500 cycles, but again..we can make it work even faster with some extra effort. I can, make it works in something around 1600 cycles
(Without loosing precision) simply removing the check for NANs and commenting this line

Code Select



                cmp     ecx, 2045 ; Check for NAN, Infinite, Zero and Denormalized values
                jbe     loc_4C6BFC
(...)

loc_4C6BFC:

But, if we do so, we won´t be able to check for errors on the input number and neither would be able to calculate the log10 of denormalized values.

The parameters of usage are the same as in the exp function. So:

The parameters are:

1st Parameter = The value to be calculated. It can be a integer, a Float or a address to Real8 (or qword)
2nd Parameter - The flag to be used on each type of input. I build a set of equates for that.
SSE_EXP_INT equ 1
SSE_EXP_FLOAT equ 2
SSE_EXP_REAL8 equ 4
SSE_EXP_QWORD equ 8

So, if you want to calculate the log10 of 5, and the input is a integer you must use

invoke Sse2_log10_precise, 5, 1
or
invoke Sse2_log10_precise, 5, SSE_EXP_INT

If the input is a Float (Real4) you use:

MyValue Real4 5.0 ; I believe this is the masm syntax for Float, right ?

invoke Sse2_log10_precise, [MyValue], 2

or
invoke Sse2_log10_precise, [MyValue], SSE_EXP_FLOAT

If the input is a Real8, you must use as a input the address (Offset) of the Value, like this:

MyValue Real8 5.0
invoke Sse2_log10_precise, offset MyValue, 4

or

MyValue Real8 5.0
invoke Sse2_log10_precise, offset MyValue, SSE_EXP_REAL8

and, the same for using to calculate as a QWord (int64), but using 8 as a flag (2nd parameter) and the pointer to the int64 value on the 1st parameter.

Ex:
push 5 ; The value "5" we use to compute Real8 as i explained above
push offset MyExpo ; Pointer to Real8 or to a Qword only when we are using 64 bits values. For 32 Bits values (dword, Float etc),
; we use the value directly because int or Float are not pointers. (So, without the "offset" thing in masm)
call Sse2_log10_precise

RosAsm version is as follows:

Code Select


[FloatZero: R$ 0]
[SSE_Two52: Q$ 4841369599423283200, 4841369599423283200] ; = 2^52 ;D$ 0, 043300000, 0, 043300000]
[SSE_One: R$ 1, R$ 1]
[SSE_LOG_EMASK < 2.22507385850720082e-308 >]; Using Body Equate to set the proper value which is Closer to DBL_MIN that is: 2.2250738585072014e-308
[SSE_LOG_HIMASK < 1.79769313486231571e+308 >]; Using Body Equate to set the proper value which is Closer to DBL_MAX that is: 1.7976931348623158e+308
[SSE_Emask: R$ SSE_LOG_EMASK, SSE_LOG_EMASK] ; D$ 0-1, 0FFFFF,  0-1, 0FFFFF
[<16 SSE_LOG10_CC_0: R$ (111/256), R$ (111/256)] ; L10EA
[<16 SSE_Magic0: R$ 4.39804651110300781e+12, R$ 4.39804651110300781e+12]
[SSE_LOG10_HIMASK < 07FFFFFFF__80000000 >]; Using Body Equate to set the proper value which is Closer to DBL_MAX that is: 1.7976931348623158e+308
[SSE_HiMask0: Q$ SSE_LOG10_HIMASK, SSE_LOG10_HIMASK]
[<16 SSE_Place_Log2: D$ 0, 0
                     D$ 0FFFFFFFF, 0FFFFFFFF,
                     D$ 0FFFFFFFF, 0FFFFFFFF,
                     D$ 0, 0]
[<16 SSE_Log1020: R$ 3.01029995663952832e-1, R$ 2.83633945510449641e-14] ; Log10(2) = 0.3010299956639811952137388947244930267681898814621085413104274611
[<16 Log_Coeff0: R$ 21.5354732628465832, R$ -3.07179525615370474]
[<16 SSE_Log10Var1: R$ -10.8935578527763628, R$ 1.77588163534834509]
[<16 SSE_Log10Var2: R$ 5.66760060334353621, R$ -1.15501676674018694]
[<16 SSE_Log10Var3: R$ 1.61610240749971053e-3, R$ 0]

; most likely a Log10(x) table from 0 to 2 followed by some error
[<16 Log10_Table_T: R$ 0,                        R$ 0,                        R$ 6.83942453019881210e-003, R$ 1.06653844666057790e-013
                    R$ 1.33507081443440260e-002, R$ 8.67505934719954320e-014, R$ 1.99611018521181900e-002, R$ 9.23215523723840900e-014
                    R$ 2.62229227369061850e-002, R$ 7.49930509874774870e-014, R$ 3.25763513509400580e-002, R$ 2.41276323350416150e-014
                    R$ 3.90241079016959700e-002, R$ 1.07524865705891230e-014, R$ 4.50982556138797010e-002, R$ 2.01960064510954180e-014
                    R$ 5.12585643186866950e-002, R$ 3.16576133915421310e-014, R$ 5.70236199724831750e-002, R$ 2.44070710374291600e-014
                    R$ 6.31113911136935710e-002, R$ 2.48257788963393190e-014, R$ 6.87885240052992230e-002, R$ 1.09693456491489220e-013
                    R$ 7.45408528944153660e-002, R$ 8.48557345140544200e-014, R$ 8.03703965551676450e-002, R$ 5.64317690419344980e-014
                    R$ 8.60206705779091860e-002, R$ 2.11074530681352090e-014, R$ 9.14835662794075690e-002, R$ 2.48825492648481080e-014
                    R$ 9.70160548793046470e-002, R$ 8.88302993483350900e-014, R$ 1.02351435027458140e-001, R$ 8.15103304243699220e-014
                    R$ 1.07753177325776050e-001, R$ 4.45086172353454940e-014, R$ 1.12947822295495830e-001, R$ 3.08831178628687270e-015
                    R$ 1.18205353949292660e-001, R$ 3.88929326942825740e-014, R$ 1.23245578588807800e-001, R$ 4.71750903210203580e-014
                    R$ 1.28344985300145710e-001, R$ 6.57467750563177680e-014, R$ 1.33216699989134210e-001, R$ 2.71250821687622270e-014
                    R$ 1.38435254551609430e-001, R$ 7.56877432679591610e-015, R$ 1.43127205461155430e-001, R$ 6.81055570899612790e-015
                    R$ 1.48168577326714510e-001, R$ 6.02542674600232690e-014, R$ 1.52967460208515150e-001, R$ 2.83427723708756480e-014
                    R$ 1.57515087959154700e-001, R$ 1.09440764415528030e-013, R$ 1.62418959194383210e-001, R$ 5.35145729088875760e-014
                    R$ 1.67067178541742580e-001, R$ 5.99509971730059460e-014, R$ 1.71450865902443180e-001, R$ 1.13452388582535560e-013
                    R$ 1.76197300927015020e-001, R$ 3.28390775041250930e-015, R$ 1.80674603281659070e-001, R$ 1.03481913998834240e-013
                    R$ 1.85198545041771470e-001, R$ 3.73116362151814480e-014, R$ 1.89441967200082220e-001, R$ 2.98008559425539880e-014
                    R$ 1.93727260613627550e-001, R$ 8.13197508503595880e-014, R$ 1.98055259839406970e-001, R$ 3.57492486261199750e-014
                    R$ 2.02426824636404490e-001, R$ 7.53155364447138070e-014, R$ 2.06501548650066980e-001, R$ 7.07722033818675070e-014
                    R$ 2.10959407186123830e-001, R$ 1.06420585148621610e-013, R$ 2.15115366957320480e-001, R$ 6.74895867627637050e-014
                    R$ 2.18960252674605730e-001, R$ 6.67672190370376170e-014, R$ 2.23193863603228240e-001, R$ 1.36388742479585940e-014
                    R$ 2.27111265564531100e-001, R$ 2.32760258694916680e-014, R$ 2.31425484637043160e-001, R$ 2.92761620518569070e-014
                    R$ 2.35053696899512940e-001, R$ 6.25806885708624120e-014, R$ 2.39080054690248290e-001, R$ 3.00590055364650880e-014
                    R$ 2.43144090557620980e-001, R$ 1.05192270531529520e-014, R$ 2.46871963076841890e-001, R$ 3.27758102924169520e-014
                    R$ 2.50632111950153560e-001, R$ 2.79059969846129800e-014, R$ 2.54425100967296200e-001, R$ 2.43505807623931280e-014
                    R$ 2.58251508820308120e-001, R$ 6.53068360680055810e-014, R$ 2.62111929633533690e-001, R$ 7.78445532603097920e-014
                    R$ 2.65615893362905810e-001, R$ 1.97240087303329050e-014, R$ 2.69542633331980140e-001, R$ 6.12305485363238830e-014
                    R$ 2.73107313935042840e-001, R$ 3.18805618188936470e-014, R$ 2.76701495678366880e-001, R$ 1.05904780751833190e-013
                    R$ 2.80325670940214880e-001, R$ 4.14680805130042090e-014, R$ 2.83572747613220600e-001, R$ 1.90891161216058830e-014
                    R$ 2.87254964996350280e-001, R$ 1.65981890215736650e-014, R$ 2.90554464110186930e-001, R$ 4.83600159236775040e-014
                    R$ 2.94296613004917160e-001, R$ 9.56300328864571600e-014, R$ 2.97650255012513300e-001, R$ 8.72995849029738040e-014
                    R$ 3.01029995663952830e-001, R$ 2.83633945510449640e-014

Log10_Table_B: R$ (227328/524288), R$ (227328/524288), R$ (223776/524288), R$ (223776/524288)
               R$ (220446/524288), R$ (220446/524288), R$ (217116/524288), R$ (217116/524288)
               R$ (214008/524288), R$ (214008/524288), R$ (210900/524288), R$ (210900/524288)
               R$ (207792/524288), R$ (207792/524288), R$ (204906/524288), R$ (204906/524288)
               R$ (202020/524288), R$ (202020/524288), R$ (199356/524288), R$ (199356/524288)
               R$ (196581/524288), R$ (196581/524288), R$ (194028/524288), R$ (194028/524288)
               R$ (191475/524288), R$ (191475/524288), R$ (188922/524288), R$ (188922/524288)
               R$ (186480/524288), R$ (186480/524288), R$ (184149/524288), R$ (184149/524288)
               R$ (181818/524288), R$ (181818/524288), R$ (179598/524288), R$ (179598/524288)
               R$ (177378/524288), R$ (177378/524288), R$ (175269/524288), R$ (175269/524288)
               R$ (173160/524288), R$ (173160/524288), R$ (171162/524288), R$ (171162/524288)
               R$ (169164/524288), R$ (169164/524288), R$ (167277/524288), R$ (167277/524288)
               R$ (165279/524288), R$ (165279/524288), R$ (163503/524288), R$ (163503/524288)
               R$ (161616/524288), R$ (161616/524288), R$ (159840/524288), R$ (159840/524288)
               R$ (158175/524288), R$ (158175/524288), R$ (156399/524288), R$ (156399/524288)
               R$ (154734/524288), R$ (154734/524288), R$ (153180/524288), R$ (153180/524288)
               R$ (151515/524288), R$ (151515/524288), R$ (149961/524288), R$ (149961/524288)
               R$ (148407/524288), R$ (148407/524288), R$ (146964/524288), R$ (146964/524288)
               R$ (145521/524288), R$ (145521/524288), R$ (144078/524288), R$ (144078/524288)
               R$ (142635/524288), R$ (142635/524288), R$ (141303/524288), R$ (141303/524288)
               R$ (139860/524288), R$ (139860/524288), R$ (138528/524288), R$ (138528/524288)
               R$ (137307/524288), R$ (137307/524288), R$ (135975/524288), R$ (135975/524288)
               R$ (134754/524288), R$ (134754/524288), R$ (133422/524288), R$ (133422/524288)
               R$ (132312/524288), R$ (132312/524288), R$ (131091/524288), R$ (131091/524288)
               R$ (129870/524288), R$ (129870/524288), R$ (128760/524288), R$ (128760/524288)
               R$ (127650/524288), R$ (127650/524288), R$ (126540/524288), R$ (126540/524288)
               R$ (125430/524288), R$ (125430/524288), R$ (124320/524288), R$ (124320/524288)
               R$ (123321/524288), R$ (123321/524288), R$ (122211/524288), R$ (122211/524288)
               R$ (121212/524288), R$ (121212/524288), R$ (120213/524288), R$ (120213/524288)
               R$ (119214/524288), R$ (119214/524288), R$ (118326/524288), R$ (118326/524288)
               R$ (117327/524288), R$ (117327/524288), R$ (116439/524288), R$ (116439/524288)
               R$ (115440/524288), R$ (115440/524288), R$ (114552/524288), R$ (114552/524288)
               R$ (113664/524288), R$ (113664/524288)]

; Parameters flag
[SSE_EXP_INT 1
 SSE_EXP_FLOAT 2
 SSE_EXP_REAL8 4
 SSE_EXP_QWORD 8]

; Values to return
[SSE_EXP_INVALID_PARAMETER 0-1] ; Invalid flag
[SSE_UNDERFLOW 0-2] ; The inputed number is underflow
[SSE_OVERFLOWN 0-3] ; The inputed number is overflow
[SSE_INFINITE 0-4] ; General error. The inputed number is infinite, or NAN etc
[SSE_ZERO 0-5] ; The inputed number is zero. Log and Ln cannot have this
[SSE_NEG_INFINITE 0-6] ; Negative Infinite found
[SSE_POS_INFINITE 0-7] ; Negative Infinite found
[SSE_NAN 0-9] ; NAN. Not a number

Proc Sse2_log10_precise:
    Arguments @Number, @Flag
    Uses edx, ecx

    mov eax D@Flag
    Test_if eax SSE_EXP_INT
        cvtsi2sd xmm0 D@Number ; converts a signed integer to double
    Test_Else_if eax SSE_EXP_FLOAT
        cvtss2sd xmm0 X@Number ; converts a single precision float to double
    Test_Else_if eax SSE_EXP_REAL8
        mov eax D@Number | movsd XMM0 X$eax
    Test_Else_if eax SSE_EXP_QWORD
        mov eax D@Number | movq XMM0 X$eax
    Test_Else
        xor eax eax | ExitP ; return 0 Invalid parameter
    Test_End

    xor edx edx
    movupd XMM1 XMM0
    unpcklpd XMM0 XMM0
    psrlq XMM1 52 | pextrw ecx XMM1 0 | and ecx 0FFF | sub ecx 1

    ...If ecx > 2045 ; Special cases. Number have some error

        .SSE_D_If xmm0 <= X$Float_Zero ; Inputed value is zero
            mov eax SSE_ZERO
            SSE_D_If xmm0 < X$Float_Zero ; Inputed value is negatve
                mov eax SSE_NEG_INFINITE
            SSE_D_End
            ExitP ; Exit the function
        .SSE_D_Else
            ..If ecx = 0-1  ; number is denormalized. We can continue
                mulsd XMM0 X$SSE_Two52 ; for very tinny numbers. Ex: x = 2e-314;XMM1
                mov edx 0-52
                xor eax eax
                movupd XMM1 XMM0
                unpcklpd XMM0 XMM0
                psrlq XMM1 52 | pextrw ecx XMM1 0 | and ecx 0FFF | sub ecx 1
            ..Else
                movupd XMM1 X$SSE_One ; same as SSE_One2
                andpd XMM0 X$SSE_Emask ; same as SSE_Emask2
                orpd XMM0 XMM1
                cmpsd XMM1 XMM0 0 ; (EQ) error in rosasm
                pextrw eax XMM1 0
                If eax = 0 ; Not a Number
                    mov eax SSE_NAN
                Else ; Number is positive infinite
                    mov eax SSE_POS_INFINITE
                End_If
                ExitP ; Exit the function
            ..End_If
        .SSE_D_End

    ...End_If

    movupd XMM1 X$SSE_Magic0 | andpd XMM0 X$SSE_Emask | orpd XMM0 X$SSE_One | addpd XMM1 XMM0
    pextrw eax XMM1 0 | and eax ((Size_of_LogTable*2)-48)

    movupd XMM4 X$Log10_Table_T+eax
    movupd XMM1 X$SSE_HiMask0 | andpd XMM1 XMM0
    subpd XMM0 XMM1 | mulpd XMM0 X$Log10_Table_B+eax
    mulpd XMM1 X$Log10_Table_B+eax | subpd XMM1 X$SSE_LOG10_CC_0
    addsd XMM4 XMM1

    sub ecx 1022 | add ecx edx
    cvtsi2sd XMM2 ecx | shl ecx 10 | add eax ecx | mov ecx 16 | mov edx 0 | cmp eax 0 | cmovz edx ecx

    movupd XMM3 XMM0 | andpd XMM3 X$SSE_Place_Log2+edx
    addpd XMM0 XMM1
     ; same as SSE_Place_Log0
    unpcklpd XMM2 XMM2 | mulpd XMM2 X$SSE_Log1020 | addpd XMM2 XMM3 | addpd XMM4 XMM2

    movupd XMM1 X$Log_Coeff0
    movupd XMM2 XMM0 | mulpd XMM2 XMM2 | mulsd XMM2 XMM2 | mulsd XMM2 XMM0
    mulpd XMM1 XMM0 | addpd XMM1 X$SSE_Log10Var1

    mulpd XMM1 XMM0
    addpd XMM1 X$SSE_Log10Var2 | mulpd XMM1 XMM2

    movupd xmm2 X$SSE_Log10Var3 | mulpd XMM2 XMM0

    movupd XMM3 XMM4
    unpckhpd XMM3 XMM3

    movupd XMM0 XMM1
    addpd XMM1 XMM2
    unpckhpd XMM0 XMM0

    addsd XMM0 XMM1
    addsd XMM0 XMM3
    addsd XMM0 XMM4

EndP

My timmings

Code Select



AMD Ryzen 5 2400G with Radeon Vega Graphics     (SSE4)

2747    cycles for 100 * Sse2_log10_precise (Guga SSE2 Log10 precise , 2.7182818^5)
11641   cycles for 100 * ExpXY (MasmBasic, 2.7182818^5)
27174   cycles for 100 * pow (CRT, 2.7182818^5)

2619    cycles for 100 * Sse2_log10_precise (Guga SSE2 Log10 precise , 2.7182818^5)
11399   cycles for 100 * ExpXY (MasmBasic, 2.7182818^5)
27011   cycles for 100 * pow (CRT, 2.7182818^5)

2632    cycles for 100 * Sse2_log10_precise (Guga SSE2 Log10 precise , 2.7182818^5)
11207   cycles for 100 * ExpXY (MasmBasic, 2.7182818^5)
25996   cycles for 100 * pow (CRT, 2.7182818^5)

2593    cycles for 100 * Sse2_log10_precise (Guga SSE2 Log10 precise , 2.7182818^5)
11419   cycles for 100 * ExpXY (MasmBasic, 2.7182818^5)
26219   cycles for 100 * pow (CRT, 2.7182818^5)

2553    cycles for 100 * Sse2_log10_precise (Guga SSE2 Log10 precise , 2.7182818^5)
11310   cycles for 100 * ExpXY (MasmBasic, 2.7182818^5)
26940   cycles for 100 * pow (CRT, 2.7182818^5)

148.413159102577 for Sse2_log10_precise (Guga SSE2 Log10 precise , 2.7182818^5)
148.413159102577 for ExpXY (MasmBasic, 2.7182818^5)
148.413159102577 for pow (CRT, 2.7182818^5)

Updated new version with the proper values (in between parenthesis) calculated. File: TimmingsLog10g.zip (faster and still accurate)

Siekmanski · August 16, 2020, 06:49:08 AM

Code Select

Intel(R) Core(TM) i7-4930K CPU @ 3.40GHz (SSE4)

3398    cycles for 100 * Sse2_log10_precise (Guga SSE2 Log10 precise , 2.7182818^5)
12420   cycles for 100 * ExpXY (MasmBasic, 2.7182818^5)
24740   cycles for 100 * pow (CRT, 2.7182818^5)

3412    cycles for 100 * Sse2_log10_precise (Guga SSE2 Log10 precise , 2.7182818^5)
12403   cycles for 100 * ExpXY (MasmBasic, 2.7182818^5)
24733   cycles for 100 * pow (CRT, 2.7182818^5)

3402    cycles for 100 * Sse2_log10_precise (Guga SSE2 Log10 precise , 2.7182818^5)
12412   cycles for 100 * ExpXY (MasmBasic, 2.7182818^5)
24766   cycles for 100 * pow (CRT, 2.7182818^5)

3401    cycles for 100 * Sse2_log10_precise (Guga SSE2 Log10 precise , 2.7182818^5)
12403   cycles for 100 * ExpXY (MasmBasic, 2.7182818^5)
24788   cycles for 100 * pow (CRT, 2.7182818^5)

3411    cycles for 100 * Sse2_log10_precise (Guga SSE2 Log10 precise , 2.7182818^5)
12437   cycles for 100 * ExpXY (MasmBasic, 2.7182818^5)
24734   cycles for 100 * pow (CRT, 2.7182818^5)

148.413159102577 for Sse2_log10_precise (Guga SSE2 Log10 precise , 2.7182818^5)
148.413159102577 for ExpXY (MasmBasic, 2.7182818^5)
148.413159102577 for pow (CRT, 2.7182818^5)

guga · August 16, 2020, 07:24:05 AM

Hi Marinus

You were right. I´m liking this SSE stuff

I must confess, it´s a bit complicated at 1st but the results worth the efford

I´m giving a try on those faster functions to try building a dll for use with the image processing functions. This could be very usefull.

HSE · August 16, 2020, 07:47:28 AM

Code Select

AMD A6-3500 APU with Radeon(tm) HD Graphics (SSE3)

5396    cycles for 100 * Sse2_log10_precise (Guga SSE2 Log10 precise )
11737   cycles for 100 * SmplMath:  fSlv MyReal8 = log(MyExpo)

5388    cycles for 100 * Sse2_log10_precise (Guga SSE2 Log10 precise )
11746   cycles for 100 * SmplMath:  fSlv MyReal8 = log(MyExpo)

5387    cycles for 100 * Sse2_log10_precise (Guga SSE2 Log10 precise )
11779   cycles for 100 * SmplMath:  fSlv MyReal8 = log(MyExpo)

5447    cycles for 100 * Sse2_log10_precise (Guga SSE2 Log10 precise )
11731   cycles for 100 * SmplMath:  fSlv MyReal8 = log(MyExpo)

5388    cycles for 100 * Sse2_log10_precise (Guga SSE2 Log10 precise )
11745   cycles for 100 * SmplMath:  fSlv MyReal8 = log(MyExpo)

0.699076046207356 for Sse2_log10_precise (Guga SSE2 Log10 precise )
0.698970004336019 for SmplMath:  fSlv MyReal8 = log(MyExpo)

Siekmanski · August 16, 2020, 07:48:26 AM

jj2007 · August 16, 2020, 08:00:03 AM

Does it return the result in xmm0?

call Sse2_log10_precise ; invoke Sse2_log10_precise, 5, 1
movlps MyReal8, xmm0

HSE · August 16, 2020, 08:07:03 AM

Apparently is faster :

Code Select

    movdqu oword ptr MyReal8, xmm0

LATER:

Quote from: jj2007 on August 16, 2020, 08:00:03 AM
movlps MyReal8, xmm0

perhaps is:

Code Select

movlpd MyReal8, xmm0

Code Select

AMD A6-3500 APU with Radeon(tm) HD Graphics (SSE3)

5210    cycles for 100 * Sse2_log10_precise (Guga SSE2 Log10 precise )
11760   cycles for 100 * SmplMath:  fSlv MyReal8 = log(MyExpo)

5217    cycles for 100 * Sse2_log10_precise (Guga SSE2 Log10 precise )
11750   cycles for 100 * SmplMath:  fSlv MyReal8 = log(MyExpo)

5213    cycles for 100 * Sse2_log10_precise (Guga SSE2 Log10 precise )
11758   cycles for 100 * SmplMath:  fSlv MyReal8 = log(MyExpo)

5228    cycles for 100 * Sse2_log10_precise (Guga SSE2 Log10 precise )
11796   cycles for 100 * SmplMath:  fSlv MyReal8 = log(MyExpo)

5215    cycles for 100 * Sse2_log10_precise (Guga SSE2 Log10 precise )
11820   cycles for 100 * SmplMath:  fSlv MyReal8 = log(MyExpo)

0.699076046207356 for Sse2_log10_precise (Guga SSE2 Log10 precise )
0.698970004336019 for SmplMath:  fSlv MyReal8 = log(MyExpo)

guga · August 16, 2020, 08:30:25 AM

Quote from: jj2007 on August 16, 2020, 08:00:03 AM
Does it return the result in xmm0?

call Sse2_log10_precise ; invoke Sse2_log10_precise, 5, 1
movlps MyReal8, xmm0

Hi JJ

Yes, the result is in xmm0

If i ported it correctly, it should return this:

jj2007 · August 16, 2020, 08:38:17 AM

Quote from: guga on August 16, 2020, 08:30:25 AM
Quote from: jj2007 on August 16, 2020, 08:00:03 AM
Does it return the result in xmm0?

call Sse2_log10_precise ; invoke Sse2_log10_precise, 5, 1
movlps MyReal8, xmm0
Hi JJ

Yes, the result is in xmm0

If i ported it correctly, it should return this:

Just add the movlps MyReal8, xmm0, and you will see the result at the end.

HSE · August 16, 2020, 08:58:08 AM

movlpd

jj2007 · August 16, 2020, 09:46:48 AM

Quote from: HSE on August 16, 2020, 08:58:08 AM
movlpd

Most if not all sse* mov instructions don't care what format they are moving. I use movlps because it's one byte shorter.

guga · August 16, 2020, 09:52:04 AM

Quote from: jj2007 on August 16, 2020, 08:38:17 AM
Quote from: guga on August 16, 2020, 08:30:25 AM
Quote from: jj2007 on August 16, 2020, 08:00:03 AM
Does it return the result in xmm0?

call Sse2_log10_precise ; invoke Sse2_log10_precise, 5, 1
movlps MyReal8, xmm0
Hi JJ

Yes, the result is in xmm0

If i ported it correctly, it should return this:

Just add the movlps MyReal8, xmm0, and you will see the result at the end.

Hi JJ. the value is showed incorrectly. I guess i didn´t port it properly.

Most likely is on the Data variables.

Log10_Table_T and Log10_Table_B are part of the same Table. So, basically it is a Array divided by 2, but when i ported to masm, the Values of Log10_Table_B was gone

When i ported as:

Log10_Table_B dq (227328/524288), (227328/524288), (223776/524288), (223776/524288)
dq (220446/524288), (220446/524288), (217116/524288), (217116/524288)
dq (214008/524288), (214008/524288), (210900/524288), (210900/524288)
dq (207792/524288), (207792/524288), (204906/524288), (204906/524288)
dq (202020/524288), (202020/524288), (199356/524288), (199356/524288)
(...)

All those values between parenthesis was zeroed. Masm compiled it as:
Log10_Table_B xmmword 41h dup(0)

Why ?

How can i make masm calculate the values of the Qwords, such as 22046/524288 ? I need to remove the parenthesis ?

HSE · August 16, 2020, 10:19:50 AM

Quote from: jj2007 on August 16, 2020, 09:46:48 AM
Most if not all sse* mov instructions don't care what format they are moving. I use movlps because it's one byte shorter.

Well JJ, you are lucky: your machine is smarter than mine!

movlps :

Code Select

9744 cycles for 100 * Sse2_log10_precise (Guga SSE2 Log10 precise )

molpd :

Code Select

5192 cycles for 100 * Sse2_log10_precise (Guga SSE2 Log10 precise )

movdqu oword ptr MyReal8, xmm0 :

Code Select

5384 cycles for 100 * Sse2_log10_precise (Guga SSE2 Log10 precise

jj2007 · August 16, 2020, 10:27:33 AM

That looks odd, can you post the executables? The call to the routine is a hundred times slower than the movlps/movpld

Let's test it:

Code Select

Intel(R) Core(TM) i5-2450M CPU @ 2.50GHz (SSE4)

25285   cycles for 10000 * movlps
25166   cycles for 10000 * movlpd
25808   cycles for 10000 * movdqu

25155   cycles for 10000 * movlps
25053   cycles for 10000 * movlpd
25257   cycles for 10000 * movdqu

25271   cycles for 10000 * movlps
24811   cycles for 10000 * movlpd
25296   cycles for 10000 * movdqu

25034   cycles for 10000 * movlps
25595   cycles for 10000 * movlpd
25706   cycles for 10000 * movdqu

24902   cycles for 10000 * movlps
25345   cycles for 10000 * movlpd
25546   cycles for 10000 * movdqu

28      bytes for movlps
32      bytes for movlpd
32      bytes for movdqu

R8      1234567890.123456716
R8      1234567890.123456716
R8      1234567890.123456716

HSE · August 16, 2020, 10:31:42 AM

It's the rutine with the mov .

The code its above, just change the line 345.

The MASM Forum

News:

Fast Log10 approximation

guga

Siekmanski

guga

HSE

Siekmanski

jj2007

HSE

guga

jj2007

HSE

jj2007

guga

HSE

jj2007

HSE