The MASM Forum

General => The Laboratory => Topic started by: guga on August 12, 2020, 12:29:23 AM

Title: Fast Exp approximation
Post by: guga on August 12, 2020, 12:29:23 AM
A fast and precise exponent function using SSE2  :azn: :azn: :azn:

In my AMD it took only 31 to 32 clocks in average. I´m quite sure it still have some more room to optimize :mrgreen: :mrgreen:


[SSE_Cv4: R$ 92.3324826168936567, R$ 92.3324826168936567]
; Int64ToDoubleMagic
[SSE_Shifter4: R$ 6.755399441055744e+15,    R$ 6.755399441055744e+15,
               R$ 8.67361737988403547e-19,  R$ 8.67361737988403547e-19]
[SSE_ExpVar1: R$ 1.08304246962234174e-2, R$ 1.08304246962234174e-2]
[SSE_ExpVar2: R$ 2.5728046223276691e-14, R$ 2.5728046223276691e-14]
[SSE_ExpVar3: R$ 8.33216827061673306e-3, R$ 1.66666666698150945e-1]
[SSE_ExpVar4: R$ 4.16667208687284685e-2, R$ 4.99999999999566291e-1]

[ExpTable: R$ 0,                       0,                        2.04582073601169570e-16, 2.42294657314453310e-310
           R$ 4.99974487227263260e-17, 4.87227730461210550e-310, 7.35784687124741820e-18, 7.34827949904982370e-310
           R$ 8.18931763819551360e-17, 9.85124358964738520e-310, 1.66658814423267470e-18, 1.23814631722046340e-309
           R$ 1.34046152845389320e-16, 1.49392350395699130e-309, 1.44126483728110840e-16, 1.75248592164539530e-309
           R$ 1.75676685628521850e-16, 2.01386389946222700e-309, 4.77695942525622330e-17, 2.27808809684712520e-309
           R$ 9.34171060990504440e-17, 2.54518950709912420e-309, 4.58567032666235110e-17, 2.81519946101217220e-309
           R$ 7.82657325863607620e-17, 3.08814963055018380e-309, 2.82378442595106130e-17, 3.36407203256215700e-309
           R$ 3.29047266460084160e-17, 3.64299903253770950e-309, 4.72136812117012750e-17, 3.92496334840353820e-309
           R$ 3.34846233362515240e-17, 4.20999805436120270e-309, 5.52755004850524930e-17, 4.49813658476670150e-309
           R$ 1.19250041407656740e-16, 4.78941273805230130e-309, 1.65291010027318310e-16, 5.08386068069106740e-309
           R$ 3.75085420130312720e-17, 5.38151495120456640e-309, 1.71528210782905300e-16, 5.68241046421419000e-309
           R$ 2.10230496752157180e-18, 5.98658251453665700e-309, 1.33575100888345410e-17, 6.29406678132401250e-309
           R$ 1.95725852931120360e-17, 6.60489933224881300e-309, 1.14594728088946900e-16, 6.91911662773481080e-309
           R$ 1.45979432294737540e-16, 7.23675552523374920e-309, 6.66380458923219520e-17, 7.55785328354870360e-309
           R$ 5.68648095791174000e-17, 7.88244756720450720e-309, 7.00787504690699450e-17, 8.21057645086578490e-309
           R$ 1.11524233151705370e-16, 8.54227842380307460e-309, 8.99959272503798580e-17, 8.87759239440759380e-309
           R$ 8.86511592917582850e-17, 9.21655769475515650e-309, 1.46901932708742340e-16, 9.55921408521980240e-309
           R$ 1.32721817677819360e-16, 9.90560175913766330e-309, 1.13655151388858840e-16, 1.02557613475215990e-308
           R$ 1.26761474172717220e-16, 1.06097339238271920e-308, 9.50689710108795960e-18, 1.09675610087706390e-308
           R$ 7.97786311075428990e-17, 1.13292845751990740e-308, 7.32795774215157790e-17, 1.16949470530139930e-308
           R$ 5.15483011707867830e-17, 1.20645913341482120e-308, 2.42539857666898060e-17, 1.24382607775970770e-308
           R$ 1.34460827434261430e-16, 1.28159992145044320e-308, 7.60136434780041980e-17, 1.31978509533039990e-308
           R$ 1.53414100536037230e-17, 1.35838607849166940e-308, 9.51550643423588250e-17, 1.39740739880045840e-308
           R$ 7.33935310510226070e-17, 1.43685363342820510e-308, 3.54094826264618290e-17, 1.47672940938847590e-308
           R$ 4.87516052622706170e-17, 1.51703940407971140e-308, 1.25886762958400070e-16, 1.55778834583388210e-308
           R$ 1.18426926175418470e-16, 1.59898101447111830e-308, 1.82140544036225900e-17, 1.64062224186037870e-308
           R$ 1.68548729062897310e-17, 1.68271691248622430e-308, 3.62161593533689420e-17, 1.72526996402176670e-308
           R$ 1.01562190116415010e-17, 1.76828638790785260e-308, 6.74378629321924300e-17, 1.81177122993855490e-308
           R$ 1.79012690760451310e-17, 1.85572959085304300e-308, 5.26537076855627410e-17, 1.90016662693389250e-308
           R$ 8.58071433248992810e-17, 1.94508755061191610e-308, 1.79639326598330220e-17, 1.99049763107757670e-308
           R$ 6.04870235399373270e-17, 2.03640219489905820e-308, 5.33680587851415070e-17, 2.08280662664707050e-308
           R$ 4.57849152770600950e-17, 2.12971636952645360e-308, 2.04142788975783030e-17, 2.17713692601466200e-308]

[SSE_Exp_MMask0: Q$ SSE_EXPONENT_MMASK, SSE_EXPONENT_MMASK]
[SSE_Exp_Bias0: Q$ SSE_EXPONENT_BIAS, SSE_EXPONENT_BIAS]
[SSE_OneVal0: R$ 1.0]
[SSE_Exp_EMask0: Q$ SSE_EXPONENT_EMASK, SSE_EXPONENT_EMASK]

[SSe_Exp_RetVal: Q$ 0]
[SSE_TmpVar2: Q$ 0]

; Equates used

[SSE_EXPONENT_BIAS 0FFC0]       ;  2^16-2^6 = 2^16-64 = 65472
[SSE_EXPONENT_MMASK 0FFFFFFC0]   ;  2^32-2^6 = 4294967232 (in int)
[SSE_EXPONENT_EMASK 0FFF00000]   ;  4293918720 (in int)

; Parameters flag
[SSE_EXP_INT 1
SSE_EXP_FLOAT 2
SSE_EXP_REAL8 4
SSE_EXP_QWORD 8]

; Values to return
[SSE_EXP_INVALID_PARAMETER 0-1] ; Invalid flag
[SSE_UNDERFLOW 0-1] ; The inputed number is underflow
[SSE_OVERFLOWN 0-2] ; The inputed number is overflow
[SSE_INFINITE 0-3] ; General error. The inputed number is infinite, or NAN etc

;;
    SSE2_Exp_precise
   
    This function calculates the exponential of a number using approximations on a higher level of accuracy.

    Parameters:
        Number (in) - The inputed value to be found. Your an insert any type of number to be calculated. Negative, positive,
                      Float, Integer, Double, Qword etc.
                      This parameter can be used as a Int or a Pointer to the value depending of the used flag.
       
        Flag. A set of flags to be used accordying to the type of input.
                Equate Name     Value       Description
                SSE_EXP_INT       1             The user inputed a Int value to be calculated.
                SSE_EXP_FLOAT     2             The inputed value is a Float (Real4) data.
                SSE_EXP_REAL8     3             The inputed value is a Pointer to a Real8 data.
                SSE_EXP_QWORD     4             The inputed value is a Pointer to a QWord data.

    Return Value:
        If the function suceeds, it returns TRUE and the resultant value is stored as a Double (Real8) in the low quadword of  xmm0 register
       
        If the function fails it returns a set of error cases described in the equates below:
        [SSE_EXP_INVALID_PARAMETER 0] ; Invalid flag specified.
        [SSE_UNDERFLOW 0-1] ; The inputed number is underflow
        [SSE_OVERFLOWN 0-2] ; The inputed number is overflow
        [SSE_INFINITE 0-3] ; General error. The inputed number is infinite, or NAN etc


    Example of usage:

    1 - The inputed value (-5) is a Signed Integer
            call Sse2_exp_precise 0-5, SSE_EXP_INT

    2 - The inputed value (-5) is a Float (Real4)
        [MyNumber: F$ -5.0]
            call Sse2_exp_precise D$MyNumber, SSE_EXP_FLOAT

    3 - The inputed value (5) is a Real8
        [MyNumber: R$ 5.0]
            call Sse2_exp_precise MyNumber, SSE_EXP_REAL8

    4 - The inputed value (5) is a Qword (int64)
        [MyNumber: Q$ 5.0]
            call Sse2_exp_precise MyNumber, SSE_EXP_QWORD
;;

Proc Sse2_exp_precise:
    Arguments @Number, @Flag
    Local @ControlWord
    Uses edx, ecx

    mov D@ControlWord 0

    mov eax D@Flag
    Test_if eax SSE_EXP_INT
        cvtsi2sd xmm0 D@Number ; converts a signed integer to double
    Test_Else_if eax SSE_EXP_FLOAT
        cvtss2sd xmm0 X@Number ; converts a single precision float to double
    Test_Else_if eax SSE_EXP_REAL8
        mov eax D@Number | movsd XMM0 X$eax
    Test_Else_if eax SSE_EXP_QWORD
        mov eax D@Number | movq XMM0 X$eax
    Test_Else
        xor eax eax | ExitP ; return 0 Invalid parameter
    Test_End

    unpcklpd XMM0 XMM0 | pextrw eax XMM0 3 | and eax 07FFF
    mov edx 0408F | sub edx eax
    sub eax 03C90 | or edx eax
    ...If edx < 080000000 ; Number is not zero
        movupd XMM1 X$SSE_Cv4
        movupd XMM6 X$SSE_Shifter4
        movupd XMM2 X$SSE_ExpVar1
        movupd XMM3 X$SSE_ExpVar2

        mulpd XMM1 XMM0 | addpd XMM1 XMM6 | movupd XMM7 XMM1
        subpd XMM1 XMM6 | mulpd XMM2 XMM1
        mulpd XMM3 XMM1
        movupd XMM5 X$SSE_ExpVar4
        subpd XMM0 XMM2
        movd eax XMM7 | mov ecx eax | and ecx 03F | shl ecx 04 | sar eax 06 | mov edx eax
        subpd XMM0 XMM3
        movupd XMM2 X$ecx+ExpTable
        movupd XMM4 X$SSE_ExpVar3
        mulpd XMM4 XMM0
        movupd XMM1 XMM0
        mulpd XMM0 XMM0
        addpd XMM5 XMM4
        mulsd XMM0 XMM0
        addsd XMM1 XMM2
        unpckhpd XMM2 XMM2
        movdqu XMM6 X$SSE_Exp_MMask0 | pand XMM7 XMM6
        movdqu XMM6 X$SSE_Exp_Bias0 | paddq XMM7 XMM6 | psllq XMM7 46
        mulpd XMM0 XMM5
        addsd XMM1 XMM0
        orpd XMM2 XMM7
        unpckhpd XMM0 XMM0
        addsd XMM0 XMM1
        add edx 037E
        ..If edx <= 077C ; Adjust 0
            mulsd XMM0 XMM2
            addsd XMM0 XMM2
            mov eax &TRUE
        ..Else
            fstcw W@ControlWord
            mov dx W@ControlWord | or dx 0300 | mov W$SSe_Exp_RetVal dx | fldcw W$SSe_Exp_RetVal
            mov edx eax | sar eax 1 | sub edx eax
            movdqu XMM6 X$SSE_Exp_EMask0
            pandn XMM6 XMM2
            add eax 03FF | movd XMM3 ax | psllq XMM3 52
            orpd XMM6 XMM3
            add edx 03FF | movd XMM4 dx | psllq XMM4 52
            movlpd X$SSe_Exp_RetVal XMM0 | fld R$SSe_Exp_RetVal
            movlpd X$SSE_TmpVar2 XMM0 | fld R$SSE_TmpVar2
            fmul ST1 ST0
            faddp ST1 ST0
            movlpd X$SSe_Exp_RetVal XMM4 | fld R$SSe_Exp_RetVal
            fmulp ST1 ST0
            fstp R$SSe_Exp_RetVal
            movlps XMM0 X$SSe_Exp_RetVal
            fldcw W@ControlWord
            pextrw ecx XMM0 3 | and ecx 07FF0
            If ecx >= 07FF0 ; Overflown
                ;xor eax eax
                mov eax SSE_OVERFLOWN
            Else_If ecx = 0 ; Found underflow
                ;xor eax eax
                mov eax SSE_UNDERFLOW
            Else
                mov eax &TRUE
            End_If
        ..End_If
    ...Else
        ; Number seems to be zero. Let´s confirm.
        mov eax D@Number | and eax 07FFFFFFF
        If eax < 040900000 ; Input is zero. Return 1
            addsd XMM0 X$SSE_OneVal0
            mov eax &TRUE
        Else
            ; Error. The number can be a underflow or overflow, NAN, INF etc
            mov eax 0-1 ; Underflow_Overflow
        End_If
    ...End_If

EndP



If someone can please test to benchmark it would be appreciated :)  Attached the masm version ;)



Btw...below is another version (a inaccurate one  :rolleyes: :rolleyes: :rolleyes:) that takes only 14 clocks :mrgreen: :mrgreen: :mrgreen: :mrgreen:


[M_LN2_D: R$ 0.72134752044448170367708185223087490666832985327631055065761515174311334481361108009443447714665950725529579950184,
          R$ 0.72134752044448170367708185223087490666832985327631055065761515174311334481361108009443447714665950725529579950184] ; 0.5/(ln(2))
[SSE_FastExp_Var1: R$ 3070.95411, R$ 3070.95411]; 3 * 1024.0 - 1.05

(...)
The inputed value must be in xmm0 and..thats it. Only 10 instructions :bgrin: :bgrin: :bgrin:

    movups xmm2 X$M_LN2_D
    movups xmm1 X$SSE_FastExp_Var1 | mulpd xmm0 xmm2
    movups xmm2 xmm1 | addpd xmm2 xmm0
    psllq XMM2 11

    ; now we subtract
    subpd xmm1 xmm0
    psllq XMM1 11

    divpd xmm2 xmm1
    movups xmm0 xmm2

This code is adapted from
https://stackoverflow.com/questions/53872265/how-do-you-process-exp-with-sse2

basically, it does only this:

shl 11 ((Number*(0.5/(ln(2))+(3 * 1024.0 - 1.05)))
______________________________________________
shl 11 ((Number*(0.5/(ln(2))-(3 * 1024.0 - 1.05)))



Updated. Attached JJ benchmark test "GugatestExponent.zip"
Title: Re: Fast Exp approximation
Post by: hutch-- on August 12, 2020, 12:40:20 AM
Guga,

You need to build it so anyone can test it. I downloaded in but its a source of unknown assembler type.
Title: Re: Fast Exp approximation
Post by: guga on August 12, 2020, 12:52:48 AM
Unknown ?

That´s weird. I thought it was compatible to masm. I´ll ty to port it. :thumbsup:
Title: Re: Fast Exp approximation
Post by: hutch-- on August 12, 2020, 01:12:49 AM
Just build it so we can test it for you.
Title: Re: Fast Exp approximation
Post by: guga on August 12, 2020, 02:04:44 AM
Ok. i guess now it will work.

Title: Re: Fast Exp approximation
Post by: jj2007 on August 12, 2020, 02:10:21 AM
I just see a box saying 3D frames :sad:
Title: Re: Fast Exp approximation
Post by: guga on August 12, 2020, 02:22:49 AM
try now.

It is only the function code i made. Not a working app. The returned value is stored in the 1st double in xmm0


; #########################################################################
;
; The framing affects are drawn on the client area by a single procedure,
; "Frame3D". In the WmdProc message handling Proc, the WM_PAINT message
; calls another Proc called "Paint_Proc" which contains the procedure calls
; to "Frame3D".
;
; #########################################################################

.686p
.model flat,stdcall
.mmx
.xmm

OPTION CASEMAP:NONE



; #########################################################################

      include \masm32\include\windows.inc
      include \masm32\include\user32.inc
      include \masm32\include\kernel32.inc
      include \masm32\include\gdi32.inc
     
      includelib \masm32\lib\user32.lib
      includelib \masm32\lib\kernel32.lib
      includelib \masm32\lib\gdi32.lib

; #########################################################################

      ;=============
      ; Local macros
      ;=============

      szText MACRO Name, Text:VARARG
        LOCAL lbl
          jmp lbl
            Name db Text,0
          lbl:
        ENDM

      m2m MACRO M1, M2
        push M2
        pop  M1
      ENDM

      return MACRO arg
        mov eax, arg
        ret
      ENDM

        ;=================
        ; Local prototypes
        ;=================
;        Sse2_exp_precise PROTO :DWORD,:DWORD

    .data?
    XMMWORD2 struc
     data   dq 2 dup (?)
    XMMWORD2 ends
   


    .data

        szDisplayName db "3D Frames",0
        CommandLine   dd 0
        hWnd          dd 0
        hInstance     dd 0

SSE_Cv4 dq 92.3324826168936567, 92.3324826168936567
; Int64ToDoubleMagic
SSE_Shifter4 dq 6.755399441055744e+15,    6.755399441055744e+15,
                8.67361737988403547e-19,    8.67361737988403547e-19
SSE_ExpVar1 dq 1.08304246962234174e-2, 1.08304246962234174e-2
SSE_ExpVar2 dq 2.5728046223276691e-14, 2.5728046223276691e-14
SSE_ExpVar3 dq 8.33216827061673306e-3, 1.66666666698150945e-1
SSE_ExpVar4 dq 4.16667208687284685e-2, 4.99999999999566291e-1

ExpTable dq 0,                       0,                        2.04582073601169570e-16, 2.42294657314453310e-310
               dq 4.99974487227263260e-17, 4.87227730461210550e-310, 7.35784687124741820e-18, 7.34827949904982370e-310
               dq 8.18931763819551360e-17, 9.85124358964738520e-310, 1.66658814423267470e-18, 1.23814631722046340e-309
               dq 1.34046152845389320e-16, 1.49392350395699130e-309, 1.44126483728110840e-16, 1.75248592164539530e-309
               dq 1.75676685628521850e-16, 2.01386389946222700e-309, 4.77695942525622330e-17, 2.27808809684712520e-309
               dq 9.34171060990504440e-17, 2.54518950709912420e-309, 4.58567032666235110e-17, 2.81519946101217220e-309
               dq 7.82657325863607620e-17, 3.08814963055018380e-309, 2.82378442595106130e-17, 3.36407203256215700e-309
               dq 3.29047266460084160e-17, 3.64299903253770950e-309, 4.72136812117012750e-17, 3.92496334840353820e-309
               dq 3.34846233362515240e-17, 4.20999805436120270e-309, 5.52755004850524930e-17, 4.49813658476670150e-309
               dq 1.19250041407656740e-16, 4.78941273805230130e-309, 1.65291010027318310e-16, 5.08386068069106740e-309
               dq 3.75085420130312720e-17, 5.38151495120456640e-309, 1.71528210782905300e-16, 5.68241046421419000e-309
               dq 2.10230496752157180e-18, 5.98658251453665700e-309, 1.33575100888345410e-17, 6.29406678132401250e-309
               dq 1.95725852931120360e-17, 6.60489933224881300e-309, 1.14594728088946900e-16, 6.91911662773481080e-309
               dq 1.45979432294737540e-16, 7.23675552523374920e-309, 6.66380458923219520e-17, 7.55785328354870360e-309
               dq 5.68648095791174000e-17, 7.88244756720450720e-309, 7.00787504690699450e-17, 8.21057645086578490e-309
               dq 1.11524233151705370e-16, 8.54227842380307460e-309, 8.99959272503798580e-17, 8.87759239440759380e-309
               dq 8.86511592917582850e-17, 9.21655769475515650e-309, 1.46901932708742340e-16, 9.55921408521980240e-309
               dq 1.32721817677819360e-16, 9.90560175913766330e-309, 1.13655151388858840e-16, 1.02557613475215990e-308
               dq 1.26761474172717220e-16, 1.06097339238271920e-308, 9.50689710108795960e-18, 1.09675610087706390e-308
               dq 7.97786311075428990e-17, 1.13292845751990740e-308, 7.32795774215157790e-17, 1.16949470530139930e-308
               dq 5.15483011707867830e-17, 1.20645913341482120e-308, 2.42539857666898060e-17, 1.24382607775970770e-308
               dq 1.34460827434261430e-16, 1.28159992145044320e-308, 7.60136434780041980e-17, 1.31978509533039990e-308
               dq 1.53414100536037230e-17, 1.35838607849166940e-308, 9.51550643423588250e-17, 1.39740739880045840e-308
               dq 7.33935310510226070e-17, 1.43685363342820510e-308, 3.54094826264618290e-17, 1.47672940938847590e-308
               dq 4.87516052622706170e-17, 1.51703940407971140e-308, 1.25886762958400070e-16, 1.55778834583388210e-308
               dq 1.18426926175418470e-16, 1.59898101447111830e-308, 1.82140544036225900e-17, 1.64062224186037870e-308
               dq 1.68548729062897310e-17, 1.68271691248622430e-308, 3.62161593533689420e-17, 1.72526996402176670e-308
               dq 1.01562190116415010e-17, 1.76828638790785260e-308, 6.74378629321924300e-17, 1.81177122993855490e-308
               dq 1.79012690760451310e-17, 1.85572959085304300e-308, 5.26537076855627410e-17, 1.90016662693389250e-308
               dq 8.58071433248992810e-17, 1.94508755061191610e-308, 1.79639326598330220e-17, 1.99049763107757670e-308
               dq 6.04870235399373270e-17, 2.03640219489905820e-308, 5.33680587851415070e-17, 2.08280662664707050e-308
               dq 4.57849152770600950e-17, 2.12971636952645360e-308, 2.04142788975783030e-17, 2.17713692601466200e-308



SSE_Exp_MMask0 dd 0FFFFFFC0h, 0, 0FFFFFFC0h, 0
SSE_Exp_Bias0 dq 0FFC0h, 0FFC0h
SSE_OneVal0 dq 1.0
SSE_Exp_EMask0 dq 0FFF00000h, 0FFF00000h

SSe_Exp_RetVal dq 0
SSE_TmpVar2 dq 0

    .code

start:




    push 1
    push 5
    call Sse2_exp_precise
    invoke ExitProcess, 0

; #########################################################################

Sse2_exp_precise proc

ControlWord     = dword ptr -4
Number          = dword ptr  8
Flag            = dword ptr  0Ch

                push    ebp
                mov     ebp, esp
                sub     esp, 4
                push    edx
                push    ecx
                mov     [ebp+ControlWord], 0
                mov     eax, [ebp+Flag]
                test    eax, 1
                jz      short loc_4CE8D0
                cvtsi2sd xmm0, [ebp+Number]
                jmp     short loc_4CE905
; ---------------------------------------------------------------------------

loc_4CE8D0:                             ; CODE XREF: Sse2_exp_precise+17?j
                test    eax, 2
                jz      short loc_4CE8DE
                cvtss2sd xmm0, [ebp+Number]
                jmp     short loc_4CE905
; ---------------------------------------------------------------------------

loc_4CE8DE:                             ; CODE XREF: Sse2_exp_precise+25?j
                test    eax, 4
                jz      short loc_4CE8EE
                mov     eax, [ebp+Number]
                movsd   xmm0, qword ptr [eax]
                jmp     short loc_4CE905
; ---------------------------------------------------------------------------

loc_4CE8EE:                             ; CODE XREF: Sse2_exp_precise+33?j
                test    eax, 8
                jz      short loc_4CE8FE
                mov     eax, [ebp+Number]
                movq    xmm0, qword ptr [eax]
                jmp     short loc_4CE905
; ---------------------------------------------------------------------------

loc_4CE8FE:                             ; CODE XREF: Sse2_exp_precise+43?j
                xor     eax, eax
                jmp     loc_4CEAE7
; ---------------------------------------------------------------------------

loc_4CE905:                             ; CODE XREF: Sse2_exp_precise+1E?j
                                        ; Sse2_exp_precise+2C?j ...
                unpcklpd xmm0, xmm0
                pextrw  eax, xmm0, 3
                and     eax, 7FFFh
                mov     edx, 408Fh
                sub     edx, eax
                sub     eax, 3C90h
                or      edx, eax
                cmp     edx, 80000000h
                jnb     loc_4CEAC4
                movupd  xmm1, SSE_Cv4
                movupd  xmm6, SSE_Shifter4
                movupd  xmm2, SSE_ExpVar1
                movupd  xmm3, SSE_ExpVar2
                mulpd   xmm1, xmm0
                addpd   xmm1, xmm6
                movupd  xmm7, xmm1
                subpd   xmm1, xmm6
                mulpd   xmm2, xmm1
                mulpd   xmm3, xmm1
                movupd  xmm5, SSE_ExpVar4
                subpd   xmm0, xmm2
                movd    eax, xmm7
                mov     ecx, eax
                and     ecx, 3Fh
                shl     ecx, 4
                sar     eax, 6
                mov     edx, eax
                subpd   xmm0, xmm3
                movupd  xmm2, ExpTable[ecx]
                movupd  xmm4, SSE_ExpVar3
                mulpd   xmm4, xmm0
                movupd  xmm1, xmm0
                mulpd   xmm0, xmm0
                addpd   xmm5, xmm4
                mulsd   xmm0, xmm0
                addsd   xmm1, xmm2
                unpckhpd xmm2, xmm2
                movdqu  xmm6, xmmword ptr SSE_Exp_MMask0
                pand    xmm7, xmm6
                movdqu  xmm6, xmmword ptr SSE_Exp_Bias0
                paddq   xmm7, xmm6
                psllq   xmm7, 2Eh
                mulpd   xmm0, xmm5
                addsd   xmm1, xmm0
                orpd    xmm2, xmm7
                unpckhpd xmm0, xmm0
                addsd   xmm0, xmm1
                add     edx, 37Eh
                cmp     edx, 77Ch
                ja      loc_4CEA07
                mulsd   xmm0, xmm2
                addsd   xmm0, xmm2
                mov     eax, 1
                jmp     loc_4CEABF
; ---------------------------------------------------------------------------

loc_4CEA07:                             ; CODE XREF: Sse2_exp_precise+13F?j
                fstcw   word ptr [ebp+ControlWord]
                mov     dx, word ptr [ebp+ControlWord]
                or      dx, 300h
                mov     word ptr SSe_Exp_RetVal, dx
                fldcw   word ptr SSe_Exp_RetVal
                mov     edx, eax
                sar     eax, 1
                sub     edx, eax
                movdqu  xmm6, xmmword ptr SSE_Exp_EMask0
                pandn   xmm6, xmm2
                add     eax, 3FFh
                movd    xmm3, eax
                psllq   xmm3, 34h
                orpd    xmm6, xmm3
                add     edx, 3FFh
                movd    xmm4, edx
                psllq   xmm4, 34h
                movlpd  SSe_Exp_RetVal, xmm0
                fld     SSe_Exp_RetVal
                movlpd  SSE_TmpVar2, xmm0
                fld     SSE_TmpVar2
                fmul    st(1), st
                faddp   st(1), st
                movlpd  SSe_Exp_RetVal, xmm4
                fld     SSe_Exp_RetVal
                fmulp   st(1), st
                fstp    SSe_Exp_RetVal
                movlps  xmm0, SSe_Exp_RetVal
                fldcw   word ptr [ebp+ControlWord]
                pextrw  ecx, xmm0, 3
                and     ecx, 7FF0h
                cmp     ecx, 7FF0h
                jb      short loc_4CEAAE
                mov     eax, 0FFFFFFFEh
                jmp     short loc_4CEABF
; ---------------------------------------------------------------------------

loc_4CEAAE:                             ; CODE XREF: Sse2_exp_precise+1F5?j
                cmp     ecx, 0
                jnz     short loc_4CEABA
                mov     eax, 0FFFFFFFFh
                jmp     short loc_4CEABF
; ---------------------------------------------------------------------------

loc_4CEABA:                             ; CODE XREF: Sse2_exp_precise+201?j
                mov     eax, 1

loc_4CEABF:                             ; CODE XREF: Sse2_exp_precise+152?j
                                        ; Sse2_exp_precise+1FC?j ...
                jmp     loc_4CEAE7
; ---------------------------------------------------------------------------

loc_4CEAC4:                             ; CODE XREF: Sse2_exp_precise+77?j
                mov     eax, [ebp+Number]
                and     eax, 7FFFFFFFh
                cmp     eax, 40900000h
                jnb     short loc_4CEAE2
                addsd   xmm0, SSE_OneVal0
                mov     eax, 1
                jmp     short loc_4CEAE7
; ---------------------------------------------------------------------------

loc_4CEAE2:                             ; CODE XREF: Sse2_exp_precise+221?j
                mov     eax, 0FFFFFFFFh

loc_4CEAE7:                             ; CODE XREF: Sse2_exp_precise+50?j
                                        ; Sse2_exp_precise:loc_4CEABF?j ...
                pop     ecx
                pop     edx
                mov     esp, ebp
                pop     ebp
                retn    8
Sse2_exp_precise endp

; #########################################################################

end start

Title: Re: Fast Exp approximation
Post by: jj2007 on August 12, 2020, 04:54:02 AM
It looks like you used the asm output of a compiler. What does...
    push 1
    push 5
    call Sse2_exp_precise

... perform? What are the parameters? What is the expected output? I get 148.4131591 in the low double of xmm0, does that make sense, and how?

What exactly is an "exp function"? I consider this  (http://www.jj2007.eu/MasmBasicQuickReference.htm#Mb1444)an "exp function":
include \masm32\MasmBasic\MasmBasic.inc
MyDouble double ?
Init
  ExpXY(4, 0.5, MyDouble)
  Inkey Str$("4^0.5=%9f", MyDouble) ; Output: 4^0.5=2.00000000
EndOfCode


P.S., to make your source assemble with UAsm:
movupd  xmm6, OWORD ptr SSE_Shifter4
movupd  xmm2, OWORD ptr ExpTable[ecx]
Title: Re: Fast Exp approximation
Post by: guga on August 12, 2020, 06:33:55 AM
Hi JJ

Yes, when you input 5 as an integer value, the result is 148.4131591....

The parameters are:

1st Parameter = The value to be calculated. It can be a integer, a Float or a address to  Real8  (or qword)
2nd Parameter - The flag to be used on each type of input.   I build a set of equates for that.
SSE_EXP_INT  equ 1
SSE_EXP_FLOAT equ 2
SSE_EXP_REAL8 equ 4
SSE_EXP_QWORD equ 8

                            So, if the input is a integer you must use
invoke Sse2_exp_precise, 5, 1
or
invoke Sse2_exp_precise, 5, SSE_EXP_INT

If the input if a Float (Real4) you use:

MyValue Real4 5.0 ; I believe this is the masm syntax for Float, right ?

invoke Sse2_exp_precise, [MyValue], 2

or
invoke Sse2_exp_precise, [MyValue], SSE_EXP_FLOAT

If the input is a Real8, you must use as a input the address (Offset) of the Value, like this:

MyValue Real8 5.0
invoke Sse2_exp_precise, offset MyValue, 4

or

MyValue Real8 5.0
invoke Sse2_exp_precise, offset MyValue, SSE_EXP_REAL8

And, the same for a Qword (int64). I don´t remember what is the syntax for a Qword in masm. But it is for integers in 64 bits.

It works for either positive or negative values.


Exp is the exponencial function.

exp(5) = 148.4131591025766034211155800405522796234876675938789890467

on mine version it is close (If not, equal)
exp(5) = 148.4131591...

I was trying to make it work with masmbasic, but i don´t know how to setup the output for ir display the result 148.41....

I tried this:


  push esi
  push loops-1
  NanoTimer()
push 1
push 5
call Sse2_exp_precise
; cvtsd2ss eax, xmm0 ; convert to int
cvtsd2ss xmm0, xmm0

; cvtsd2si  eax, xmm0
; fstp tmp8

Print Str$("## SSe_2_exp_Precise result Exp(5) = %f", eax),  Str$("<>%i ##\n\n", esi)



But...got nowehere. I only got the amount of time the function took to work. (Attached file). It assembled on masmbasic and worked, but i don´t know how to set a variable in masm basic to see the result.
Title: Re: Fast Exp approximation
Post by: jj2007 on August 12, 2020, 11:50:31 AM
Here it is - your source, formatted and ready to launch, Str$(f:xmm0) did the job ;-)

  REPEAT 3
  NanoTimer()
  push loops-1
  .Repeat
push 1 ;
push 5
call Sse2_exp_precise ; invoke Sse2_exp_precise, 5, 1
dec stack
  .Until Sign?
  pop edx
  Print Str$("%i Mio Sse2_exp_precise iterations took ", loops/1000000), NanoTimer$(), Str$(" - result Exp(5) = %9f\n", f:xmm0)
  NanoTimer()
  push loops-1
  .Repeat
ExpE(5, tmp8)
dec stack
  .Until Sign?
  pop edx
  Print Str$("%i Mio MasmBasic ExpE() iterations took ", loops/1000000), NanoTimer$(), Str$(" - result Exp(5) = %9f\n\n", tmp8)
  ENDM


You made me discover a bug in MasmBasic, therefore it will build correctly only with MB version 12 August 2020 (http://masm32.com/board/index.php?topic=94.0) :tongue:

Actually, it was not really a bug but...

MbExp PROTO
if 0
       ffree st(7)             ; MbExp[-12]
        fldl2e
        jmp @F
Mb10X:         ffree st(7)     ; MbExp[-6]
        fldl2t
@@:     fmul
endif
align 4
MbExp proc
        sub esp, 3*DWORD
        r10 equ stack
        mov eax, 3fffh          ; credits to bitRake (http://www.asmcommunity.net/forums/topic/?id=2979#post-30586), qWord & Agner Fog (http://masm32.com/board/index.php?topic=1783.msg18292#msg18292)
        xor edx, edx


What happened here is that a few years ago, when looking at this, I didn't quite understand what this code did (April 2013 - oooold (http://masm32.com/board/index.php?topic=94.msg18468#msg18468)), so I commented it out with the if 0, and from then on a call MbExp[-12] ended up in No Man's Land. A clear case of not enough comments :cool:
Title: Re: Fast Exp approximation
Post by: guga on August 12, 2020, 03:05:15 PM
Timmings Result


JJ, can you please see if i ported it correctly this time?  I was thinking a bit weird the result be that slow and gave a try with another timming example you made using Dave routine called "f8Exp". I replaced his version with mine to test the speed of the algo

Can you please see if the porting is ok, and also try to force the result to export from xmm0 to MyReal8 to we also check the resultant value and the timmings ?

Here, the timmings are now Ok. They fits to the results i´ve got earlier.


AMD Ryzen 5 2400G with Radeon Vega Graphics     (SSE4)

2076    cycles for 100 * Sse2_exp_precise (Guga SSE2 Exp precise , 2.7182818^5)
11193   cycles for 100 * ExpXY (MasmBasic, 2.7182818^5)
25992   cycles for 100 * pow (CRT, 2.7182818^5)

2024    cycles for 100 * Sse2_exp_precise (Guga SSE2 Exp precise , 2.7182818^5)
11148   cycles for 100 * ExpXY (MasmBasic, 2.7182818^5)
29457   cycles for 100 * pow (CRT, 2.7182818^5)

2039    cycles for 100 * Sse2_exp_precise (Guga SSE2 Exp precise , 2.7182818^5)
11639   cycles for 100 * ExpXY (MasmBasic, 2.7182818^5)
25994   cycles for 100 * pow (CRT, 2.7182818^5)

2065    cycles for 100 * Sse2_exp_precise (Guga SSE2 Exp precise , 2.7182818^5)
11204   cycles for 100 * ExpXY (MasmBasic, 2.7182818^5)
27472   cycles for 100 * pow (CRT, 2.7182818^5)

2033    cycles for 100 * Sse2_exp_precise (Guga SSE2 Exp precise , 2.7182818^5)
12207   cycles for 100 * ExpXY (MasmBasic, 2.7182818^5)
25956   cycles for 100 * pow (CRT, 2.7182818^5)

148.413159102577 for Sse2_exp_precise (Guga SSE2 Exp precise , 2.7182818^5)
148.413159102577 for ExpXY (MasmBasic, 2.7182818^5)
148.413159102577 for pow (CRT, 2.7182818^5)

--- ok ---



If the test is ok, you guys can use the function if needed, and also change it to it be more readable using the correct masm macros etc. I finished a similar routine for calculating a fast Logarithm and fast pow of a number.I should post it soon if the tests of this sse2 exp precise are ok.  :bgrin:
Title: Re: Fast Exp approximation
Post by: jj2007 on August 12, 2020, 06:09:18 PM
Quote from: guga on August 12, 2020, 03:05:15 PMJJ, can you please see if i ported it correctly this time?

I got 30 times "error A2071:initializer magnitude too large for specified size" :sad:

Then I switched to UAsm64 (see OPT_Assembler at the end), and everything was ok :biggrin: :thumbsup:

@Nidud: with AsmC /Znk I get that error A2071 :sad:

Congrats, Guga now it is definitely very fast - what is so different now?

Intel(R) Core(TM) i5-2450M CPU @ 2.50GHz (SSE4)

2669    cycles for 100 * Sse2_exp_precise (Guga SSE2 Exp precise , 2.7182818^5)
10020   cycles for 100 * ExpXY (MasmBasic, 2.7182818^5)
20399   cycles for 100 * pow (CRT, 2.7182818^5)


Thanks for the offer to use your code. I will do so if I need a really fast exponential function. For the library, I hesitate, because I am already fighting with the bloat ;-)

1788    bytes for Sse2_exp_precise (Guga SSE2 Exp precise , 2.7182818^5)
29      bytes for ExpXY (MasmBasic, 2.7182818^5)
39      bytes for pow (CRT, 2.7182818^5)
Title: Re: Fast Exp approximation
Post by: guga on August 12, 2020, 07:15:08 PM
Quote from: jj2007 on August 12, 2020, 06:09:18 PM
Quote from: guga on August 12, 2020, 03:05:15 PMJJ, can you please see if i ported it correctly this time?

I got 30 times "error A2071:initializer magnitude too large for specified size" :sad:

Then I switched to UAsm64 (see OPT_Assembler at the end), and everything was ok :biggrin: :thumbsup:

@Nidud: with AsmC /Znk I get that error A2071 :sad:

Congrats, Guga now it is definitely very fast - what is so different now?

Intel(R) Core(TM) i5-2450M CPU @ 2.50GHz (SSE4)

2669    cycles for 100 * Sse2_exp_precise (Guga SSE2 Exp precise , 2.7182818^5)
10020   cycles for 100 * ExpXY (MasmBasic, 2.7182818^5)
20399   cycles for 100 * pow (CRT, 2.7182818^5)


Thanks for the offer to use your code. I will do so if I need a really fast exponential function. For the library, I hesitate, because I am already fighting with the bloat ;-)

1788    bytes for Sse2_exp_precise (Guga SSE2 Exp precise , 2.7182818^5)
29      bytes for ExpXY (MasmBasic, 2.7182818^5)
39      bytes for pow (CRT, 2.7182818^5)


I simply updated masm basic and used the same routine you did on another post for dedndave (TimingsPow.asc). I assembled it with masm and masmbasic.


About the difference of speed, i didn´t changed the previous code, i simply used the correct base and set the parameters according. This time i made a test using Real8 values because the other functions were also being tested with Real 8.

The  parameters for the function was writen as:

   push 4   ; The value "4" we use to compute Real8 as i explained on the previous post.
     push offset MyExpo ; Pointer to Real8 or to a Qword only when we are using 64 bits values. For 32 Bits values (dword, Float etc), we use the value directly because int or Float are not pointers. (So, without the "offset" thing in masm)
   call Sse2_exp_precise

The algorithm works only for the exponential of a number (not general power of a number). So, i mainly used the Euler number as a base like this:

MyBaseA   REAL8 2.71828182845904523536028747135
MyBaseB   REAL8 2.71828182845904523536028747135
MyBaseC   REAL8 2.71828182845904523536028747135

https://en.wikipedia.org/wiki/Euler%27s_formula

Your others functions on your example were calculating "pow", so i mainly applied all of them, to use the same base as my algo to compare the results.

I´ll later post other function that calculates a general pow of a number that seems to be as fast as this one and we can then test the results  :thumbsup:

I´m pretty sure there´s more room for optimization and also avoiding it to use all xmm register. I would prefer the algo was using only xmm0, xmm1 or xmm2, but i couldn´t suceed to limit the numbers of SSE registers being used yet.

The function was originally inside ucrtbase,dll in windows10 . Its a undocumented function, but the original code is a true mess and a bit slower then mine. So, i adapted it to work as expected, but instead using only xmm0 as input, i enabled the possibility to the user inputs whatever number format he wants (integer, floats, doubles etc) and also returns in eax error cases. MY main concern is about precision.The algo has a better precision then other libraries, but i don´t know in how many digits yet it starts loosing precision.

Currently, i´m  inserting the function to be used as dll and making a small guide for it and allows teh user to also export the result on whatever way he wants (If it wouldn´t affect the general speed, of course)



Btw...if you want to play and have fun with speed, you may give a try on the other piece of code i provided as an example in the 1st post. Just use xmm0 as input and Voilà :mrgreen: :mrgreen: :mrgreen:. This piece of code below is speed as possible (even faster then mine), but it completely lacks precision.


M_LN2_D Real8 0.72134752044448170367708185223087490666832985327631055065761515174311334481361108009443447714665950725529579950184,
                Real8 0.72134752044448170367708185223087490666832985327631055065761515174311334481361108009443447714665950725529579950184 ; 0.5/(ln(2))
SSE_FastExp_Var1 Real8 3070.95411, Real8 3070.95411 ; 3 * 1024.0 - 1.05

"Real8" is the proper masm syntax to define a Double (64 Bits) floating values, right ???

(...)
    movups xmm2, X$M_LN2_D ; X$ is Rosasm syntax for xmmword.
    movups xmm1, X$SSE_FastExp_Var1
    mulpd xmm0, xmm2 ; The input number must be in xmm0
    movups xmm2, xmm1
    addpd xmm2, xmm0
    psllq XMM2, 11

    ; now we subtract
    subpd xmm1, xmm0
    psllq XMM1, 11

    divpd xmm2, xmm1
    movups xmm0, xmm2
Title: Re: Fast Exp approximation
Post by: jj2007 on August 12, 2020, 07:56:23 PM
Very nice work, Guga :thumbsup:

I've put together a little testbed for examining the pattern of the exponential function, see attachment. My suspicion is that it should be possible to calculate the mantissa from a table, and to apply the exponent afterwards. Maybe following the Fast Sinus() and Cosinus() algos (http://masm32.com/board/index.php?topic=4580.0) path :cool:

But math is my weak point. What do you see there?
Title: Re: Fast Exp approximation
Post by: guga on August 12, 2020, 08:45:29 PM
Quote from: jj2007 on August 12, 2020, 07:56:23 PM
Very nice work, Guga :thumbsup:

I've put together a little testbed for examining the pattern of the exponential function, see attachment. My suspicion is that it should be possible to calculate the mantissa from a table, and to apply the exponent afterwards. Maybe following the Fast Sinus() and Cosinus() algos (http://masm32.com/board/index.php?topic=4580.0) path :cool:

But math is my weak point. What do you see there?

Hi JJ

I see this:
(https://i.ibb.co/3kSxshY/Image1.png) (https://ibb.co/WsBr3Kp)

Btw..how you managed to create this curve window ? I like that a lot :thumbsup: :thumbsup: :thumbsup: :thumbsup:

About the fastsin, cos algos in the post, i´ll take a look today.  Calculating the mantissa from a table, should indeed be the way to go. But, i have no ide how to do it. My knowledge in math is also a bit limited. I´m not used to those complex math symbols anymore. Some i have to push from memory when i tried the engineering university when i was a young (Before i entered to law school, i tried engineering, but i quit. Did mainly 3 years or something.), but...it´s hard because i´m talking of something i was studying more then 30 years ago :greensml: :greensml: :greensml: :greensml: :greensml:

Btw...The precision starts loosing after the 6th or 7th digit i suppose. In Rosasm i can see untill 14 digits so it´s easier to check when they are different, but on your test, it have less digits to compare
Title: Re: Fast Exp approximation
Post by: jj2007 on August 12, 2020, 09:04:50 PM
Quote from: guga on August 12, 2020, 08:45:29 PMI see this:

I actually meant "what kind of pattern do you see, and how could we use it to fumble mantissa and exponent"

QuoteBtw..how you managed to create this curve window ? I like that a lot :thumbsup: :thumbsup: :thumbsup: :thumbsup:

See ArrayPlot in the help file. All you need is an array:

Event Paint
  ArrayPlot MyData(), RgbCol(200, 255, 255, 0), lines=4
  ArrayPlot exit, "Exponential function"
Title: Re: Fast Exp approximation
Post by: Siekmanski on August 12, 2020, 09:05:31 PM
Hi Guga,

You said: MY main concern is about precision.
So, a very fast Chebyshev polynomial approximation is out of the question?

For what purpose do you need the Exp function?
Title: Re: Fast Exp approximation
Post by: guga on August 12, 2020, 09:40:55 PM
Quote from: Siekmanski on August 12, 2020, 09:05:31 PM
Hi Guga,

You said: MY main concern is about precision.
So, a very fast Chebyshev polynomial approximation is out of the question?

For what purpose do you need the Exp function?

Hi Marinus

It depends of the level of precision of the Chebyshev polynomial. I think it´s usefull for what we are doing. Even considering some loss of precision, it seems to me, it won´t affect the final result.

The exp functon i was using to test that W Lambert function i told on  the other post. Even if we won´t use it for the watermark r emover, we can use for other puposes in other image filters that we could make. I gave a try on that laplace 2D algorithm that used this exp and pow functions to compute the W function. But, after you explained, i understood better this laplace thing, but i wwant to give a try calculating the sigma as a standard deviation of the whole image and see if we will need this Lambert algoithm or not.

But, even if we wouldn´t need it we can use faster pow and exp to build other algorithms such as a faster (and more accurate) CieLab for example or the other algorithm i tried last year (HSM or something)
Title: Re: Fast Exp approximation
Post by: guga on August 12, 2020, 10:04:51 PM
Quote from: jj2007 on August 12, 2020, 09:04:50 PM
Quote from: guga on August 12, 2020, 08:45:29 PMI see this:

I actually meant "what kind of pattern do you see, and how could we use it to fumble mantissa and exponent"

QuoteBtw..how you managed to create this curve window ? I like that a lot :thumbsup: :thumbsup: :thumbsup: :thumbsup:

See ArrayPlot in the help file. All you need is an array:

Event Paint
  ArrayPlot MyData(), RgbCol(200, 255, 255, 0), lines=4
  ArrayPlot exit, "Exponential function"


Hi JJ

You mean, the result ?

I took a look at the result and compared with the same value as in wolframalpha and...for my surprise, my algo seems to be 100% accurate (At least until the 13th digit) ! :dazzled: :dazzled: :dazzled: :dazzled:


These are the results i´ve got !

When using the input as Real8 i see this number:

exp(5) = 148.4131591025766

and in wolframalpha, the result is:

exp(5) = 148.4131591025766034211155800405522796234876675938789890467...

exp(-5) = 6.737946999085467e-3

in wolframalpha results in:

exp(-5) = 6.737946999085467096636048423148424248849585027355085430e-3

When i use the input as Real4(Float), i see this:

exp(5) = 148.4131591025766
exp(-5) = 6.737946999085467e-3

When i use the input as int, i see this:
exp(5) = 148.4131591025766
exp(-5) = 6.737946999085467e-3



This really surprises me, because on my initial tests, the original version from windows10 had a lack of precision after the 6th or 7th digit, but, somehow i managed to fix this damn algo, regardless the input format. So, even a int or Real4 value will result on a precise value without loss :greensml: :greensml: :greensml: :greensml: :greensml:
Title: Re: Fast Exp approximation
Post by: nidud on August 13, 2020, 01:29:54 AM
deleted
Title: Re: Fast Exp approximation
Post by: jj2007 on August 13, 2020, 03:33:04 AM
Quote from: nidud on August 13, 2020, 01:29:54 AM
@Nidud: with AsmC /Znk I get that error A2071 :sad:
Quote
Well, it is an erroneous number according to the specification used:

    dq 2.42294657314453310e-310

...
However, ML and ML64 version 14 do not produce this error.

Indeed. And Olly reports 2.4229465731445330820e-310 in ST(0) after a fld with that number.

include \masm32\MasmBasic\MasmBasic.inc
ExpTable dq 2.42294657314453310e-310 ; invalid number for AsmC
  Init
  fld ExpTable
  fld FP10(1.0e310)
  fmul
  PrintLine "expected:  2.42294657314453310"
  PrintLine Str$("obtained:  %Jf", ST(0)v)
EndOfCode


Output:
expected:  2.42294657314453310
obtained:  2.422946573144533082


Using 2.4e-320, a certain loss of precision creeps in:
expected:  2.42294657314453310
obtained:  2.422897927205473053
Title: Re: Fast Exp approximation
Post by: HSE on August 13, 2020, 05:12:28 AM
Quote from: nidud on August 13, 2020, 01:29:54 AM
DBL_MAX equ 1.7976931348623158e+308
DBL_MIN equ 2.2250738585072014e-308

That is Normalized range.

Intel specification say that only 80 bit denormalized values can be loaded without to raise an exception...  I don't know, here I have an AMD  :biggrin:
Title: Re: Fast Exp approximation
Post by: jj2007 on August 13, 2020, 05:24:35 AM
Yep, but the hardware allows much more: at e-320 you still have over 4 digits of precision, more than enough for most purposes.
Title: Re: Fast Exp approximation
Post by: nidud on August 13, 2020, 06:23:14 AM
deleted
Title: Re: Fast Exp approximation
Post by: jj2007 on August 13, 2020, 08:11:51 AM
http://www.website.masmforum.com/tutorials/fptute/fpuchap4.htm
QuoteIf the source is a denormalized number, a Denormal exception will be detected, setting the related flag in the Status Word. The value will still be loaded and normalized if possible.

This is exactly what you can observe with Olly when loading from somevar REAL8 2.42294657314453310e-320 - the D flag is set. The question is how to deal with it. IMHO UAsm does it right: accept denormal values. If a programmer insists to work in this exotic range, he should know how to handle the loss of precision. A warning would be ok, though.
Title: Re: Fast Exp approximation
Post by: HSE on August 13, 2020, 08:24:47 AM
 :thumbsup:
Title: Re: Fast Exp approximation
Post by: nidud on August 13, 2020, 09:59:25 AM
deleted
Title: Re: Fast Exp approximation
Post by: jj2007 on August 13, 2020, 10:37:20 AM
Quote from: nidud on August 13, 2020, 09:59:25 AMI'm a bitu sceptical about this approach thought

The hardware can handle the value, but AsmC refuses to accept it? Let's use UAsm then.

Seriously: if the risk of setting the Underflow flag bothers you, issue a warning. The case is so exotic that I wouldn't worry at all.
Title: Re: Fast Exp approximation
Post by: nidud on August 13, 2020, 11:43:27 AM
deleted
Title: Re: Fast Exp approximation
Post by: guga on August 13, 2020, 12:40:45 PM
Quote from: HSE on August 13, 2020, 05:12:28 AM
Quote from: nidud on August 13, 2020, 01:29:54 AM
DBL_MAX equ 1.7976931348623158e+308
DBL_MIN equ 2.2250738585072014e-308

That is Normalized range.

Intel specification say that only 80 bit denormalized values can be loaded without to raise an exception...  I don't know, here I have an AMD  :biggrin:

Tks you so much for the equates HSE
Title: Re: Fast Exp approximation
Post by: jj2007 on August 13, 2020, 07:20:29 PM
Quote from: nidud on August 13, 2020, 11:43:27 AMThe assembler don't use the FPU (or SSE) so it doesn't set the underflow flag but rather the function that convert the string sets errno to ERANGE. I assume the changes in Masm (and Uasm) has to do with external library functions rather than any deliberate changes to the assembler.

    errno = 0; /* v2.11: init errno; errno is set on over- and under-flow */
    double_value = strtod( inp, NULL );

Of course it's not a runtime check. When you try to assemble somebyte db 256, the assembler checks the bounds and throws an error. Same for somedouble REAL8 1.0e-400 - that's an error, and the assembler knows it. But there is a grey area in this case due to the capacity of the FPU to load "denormalised" values beyond 2.22e-308. They behave like legit numbers, so those who need that (a very exotic case, but we stumbled over it in this thread) should be able to use this feature.

Therefore my proposal: issue a warning for the grey area. Maybe only once, I hate long listings of identical warnings.

At present, you get fatal error A1012: error count exceeds 100; stopping assembly, and that clearly is not necessary, given that the code runs just fine when assembled with UAasm.

Sheer luxury would be a one-liner:

W1000: 112 denormalised numbers between lines 14 and 39 :cool:
Title: Re: Fast Exp approximation
Post by: Siekmanski on August 14, 2020, 01:03:38 AM
Hi guga,

Very fast real4 SSE2 Exponent routine.
Up to 4 exponents at once.
Precision: 7 digits

.const
Chebylog2E  real4 4 dup (1.44269504089)

; 5th Degree Chebyshev ( Remez Algorithm ) Polynomials
Cheby5Exp0  real4 4 dup (0.0018775767)
Cheby5Exp1  real4 4 dup (0.0089893397)
Cheby5Exp2  real4 4 dup (0.055826318)
Cheby5Exp3  real4 4 dup (0.24015361)
Cheby5Exp4  real4 4 dup (0.69315308)
Cheby5Exp5  real4 4 dup (0.99999994)

.code
align 16
SSE2_Exp:  ; in: xmm0, out: xmm0
    movaps      xmm2,oword ptr Chebylog2E
    mulps       xmm2,xmm0
    psrld       xmm0,31
    cvttps2dq   xmm1,xmm2
    psubd       xmm1,xmm0
    movdqa      xmm0,xmm1
    cvtdq2ps    xmm1,xmm1
    subps       xmm2,xmm1
    movaps      xmm1,oword ptr Cheby5Exp0
    pslld       xmm0,23
    mulps       xmm1,xmm2
    addps       xmm1,oword ptr Cheby5Exp1
    mulps       xmm1,xmm2
    addps       xmm1,oword ptr Cheby5Exp2
    mulps       xmm1,xmm2
    addps       xmm1,oword ptr Cheby5Exp3
    mulps       xmm1,xmm2
    addps       xmm1,oword ptr Cheby5Exp4
    mulps       xmm1,xmm2
    addps       xmm1,oword ptr Cheby5Exp5
    paddd       xmm0,xmm1
    ret
Title: Re: Fast Exp approximation
Post by: nidud on August 14, 2020, 01:06:50 AM
deleted
Title: Re: Fast Exp approximation
Post by: Siekmanski on August 14, 2020, 01:14:10 AM
SSE4_1 version, faster than the SSE2 version.

align 16
SSE41_Exp: ; in: xmm0, out: xmm0
    mulps       xmm0,oword ptr Chebylog2E
    roundps     xmm1,xmm0,1
    subps       xmm0,xmm1
    cvtps2dq    xmm2,xmm1
    pslld       xmm2,23
    movaps      xmm1,xmm0
    mulps       xmm0,oword ptr Cheby5Exp0
    addps       xmm0,oword ptr Cheby5Exp1
    mulps       xmm0,xmm1
    addps       xmm0,oword ptr Cheby5Exp2
    mulps       xmm0,xmm1
    addps       xmm0,oword ptr Cheby5Exp3
    mulps       xmm0,xmm1
    addps       xmm0,oword ptr Cheby5Exp4
    mulps       xmm0,xmm1
    addps       xmm0,oword ptr Cheby5Exp5
    paddd       xmm0,xmm2
    ret
Title: Re: Fast Exp approximation
Post by: jj2007 on August 14, 2020, 02:20:26 AM
@nidud: It seems you just want to argue. Frustrated? Can we help you?

include \masm32\MasmBasic\MasmBasic.inc
NotSoSmall REAL8 1.234567890123456e-308 ; legal for all assemblers, even AsmC
VerySmall REAL8 1.234567890123456e-322 ; legal for UAsm and MASM 14+
; TooSmall REAL8 1.234567890123456e-324 ; illegal even for UAsm
Result REAL8 ?
  Init
  fld NotSoSmall
  fmul FP8(1.0e154)
  fmul FP8(1.0e154)
  fstp Result
  Print Str$("NotSoSmall*1e308=\t%Jf\n", Result)
  fld VerySmall
  fmul FP8(1.0e161)
  fmul FP8(1.0e161)
  fstp Result
  Inkey Str$("VerySmall*1e322=\t%Jf\n", Result)
EndOfCode


Output:
NotSoSmall*1e308=       1.234567890123456246
VerySmall*1e322=        1.235164114603116481
Title: Re: Fast Exp approximation
Post by: guga on August 14, 2020, 02:51:24 AM
Quote from: Siekmanski on August 14, 2020, 01:03:38 AM
Hi guga,

Very fast real4 SSE2 Exponent routine.
Up to 4 exponents at once.
Precision: 7 digits

.const
Chebylog2E  real4 4 dup (1.44269504089)

; 5th Degree Chebyshev ( Remez Algorithm ) Polynomials
Cheby5Exp0  real4 4 dup (0.0018775767)
Cheby5Exp1  real4 4 dup (0.0089893397)
Cheby5Exp2  real4 4 dup (0.055826318)
Cheby5Exp3  real4 4 dup (0.24015361)
Cheby5Exp4  real4 4 dup (0.69315308)
Cheby5Exp5  real4 4 dup (0.99999994)

.code
align 16
SSE2_Exp:  ; in: xmm0, out: xmm0
    movaps      xmm2,oword ptr Chebylog2E
    mulps       xmm2,xmm0
    psrld       xmm0,31
    cvttps2dq   xmm1,xmm2
    psubd       xmm1,xmm0
    movdqa      xmm0,xmm1
    cvtdq2ps    xmm1,xmm1
    subps       xmm2,xmm1
    movaps      xmm1,oword ptr Cheby5Exp0
    pslld       xmm0,23
    mulps       xmm1,xmm2
    addps       xmm1,oword ptr Cheby5Exp1
    mulps       xmm1,xmm2
    addps       xmm1,oword ptr Cheby5Exp2
    mulps       xmm1,xmm2
    addps       xmm1,oword ptr Cheby5Exp3
    mulps       xmm1,xmm2
    addps       xmm1,oword ptr Cheby5Exp4
    mulps       xmm1,xmm2
    addps       xmm1,oword ptr Cheby5Exp5
    paddd       xmm0,xmm1
    ret



Great ! A couple of questions:


1 - How did you got those numbers of ChebyXXX ? How to calculate this cheby values ?

2 - Can we improve the precision to the 13Th digit ? What is the cost in terms of speed ?

3 - what if we extend it to a 10th Degree Chebyshev ( Remez Algorithm ) Polynomials ? Is it possible to grant more accuracy ?

Title: Re: Fast Exp approximation
Post by: Siekmanski on August 14, 2020, 04:30:26 AM
Hi guga,

I had the calculated polynomials already on my computer for a few years.
You can create them with the program called LolRemez.
It's open source.

https://github.com/samhocevar/lolremez

Maybe somebody here on the forum would be so kind to compile the program for us in Visual Studio?  :eusa_pray:

https://github.com/samhocevar/lolremez/releases/download/v0.6/lolremez-0.6.zip
Title: Re: Fast Exp approximation
Post by: guga on August 14, 2020, 05:36:21 AM
I guess i did it  :thumbsup:


See if it is working as expected, please. I had to install github on the desktop to make this thing compile :mrgreen: :mrgreen: :mrgreen:
Title: Re: Fast Exp approximation
Post by: Siekmanski on August 14, 2020, 06:16:40 AM
 :thumbsup:

lolremez -d 10 -r 0:1 "2**x" "2**x"


// Approximation of f(x) = 2**x
// with weight function g(x) = 2**x
// on interval [ 0, 1 ]
// with a polynomial of degree 10.
// p(x)=(((((((((9.9522144894077596e-9*x+9.4609455637620191e-8)*x+1.331271998
894e-6)*x+1.5244659656760603e-5)*x+1.5403957490414841e-4)*x+1.333354373097474
3)*x+9.6181294091749148e-3)*x+5.5504108628244985e-2)*x+2.402265069613608e-1)*
.9314718055989127e-1)*x+1.0000000000000002
double f(double x)
{
    double u = 9.9522144894077596e-9;
    u = u * x + 9.4609455637620191e-8;
    u = u * x + 1.331271998884894e-6;
    u = u * x + 1.5244659656760603e-5;
    u = u * x + 1.5403957490414841e-4;
    u = u * x + 1.3333543730974749e-3;
    u = u * x + 9.6181294091749148e-3;
    u = u * x + 5.5504108628244985e-2;
    u = u * x + 2.402265069613608e-1;
    u = u * x + 6.9314718055989127e-1;
    return u * x + 1.0000000000000002;
}
Title: Re: Fast Exp approximation
Post by: nidud on August 14, 2020, 06:32:37 AM
deleted
Title: Re: Fast Exp approximation
Post by: guga on August 14, 2020, 06:42:51 AM
Quote from: Siekmanski on August 14, 2020, 06:16:40 AM
:thumbsup:

lolremez -d 10 -r 0:1 "2**x" "2**x"


// Approximation of f(x) = 2**x
// with weight function g(x) = 2**x
// on interval [ 0, 1 ]
// with a polynomial of degree 10.
// p(x)=(((((((((9.9522144894077596e-9*x+9.4609455637620191e-8)*x+1.331271998
894e-6)*x+1.5244659656760603e-5)*x+1.5403957490414841e-4)*x+1.333354373097474
3)*x+9.6181294091749148e-3)*x+5.5504108628244985e-2)*x+2.402265069613608e-1)*
.9314718055989127e-1)*x+1.0000000000000002
double f(double x)
{
    double u = 9.9522144894077596e-9;
    u = u * x + 9.4609455637620191e-8;
    u = u * x + 1.331271998884894e-6;
    u = u * x + 1.5244659656760603e-5;
    u = u * x + 1.5403957490414841e-4;
    u = u * x + 1.3333543730974749e-3;
    u = u * x + 9.6181294091749148e-3;
    u = u * x + 5.5504108628244985e-2;
    u = u * x + 2.402265069613608e-1;
    u = u * x + 6.9314718055989127e-1;
    return u * x + 1.0000000000000002;
}

Hi Marinus

Thanks... :thumbsup: :thumbsup: :thumbsup:

This is what i would ask yo8u When do you know which range to use ?

I mean, on https://github.com/samhocevar/lolremez/wiki/Tutorial-1-of-5%3A-exp%28x%29-the-quick-way  It shows a example of a exp(x) function with a range of -1 to 1. Yours is a range of 0 to 1. So, i don´t get it. We need to use the value of x always normalized ???

Because even when we insert a range of 0 to 1 with exp(x), the values are different from the ones you posted.

Btw... why did you inputed "2**x" "2**x" to calculate exp(x) ?
Title: Re: Fast Exp approximation
Post by: Siekmanski on August 14, 2020, 07:08:51 AM
2**x is just the notation for the Exponent function, the second one is the weight function.
Just as you, I have to find my way in this LolRemez program.  :thumbsup:

The example -1 to 1 is often used in audio.
You stay always in that range, so you don't have compensation code for numbers outside that range.

I just tried to get the polynomials I used in my SSE2 example.
They are almost the same.
I will use the new ones. ( should be more precise since LolRemez has evolved to version 0.6  :cool:)

lolremez -d 5 -r 0:1 "2**x" "2**x"

// Approximation of f(x) = 2**x
// with weight function g(x) = 2**x
// on interval [ 0, 1 ]
// with a polynomial of degree 5.
// p(x)=((((1.8775766751914795e-3*x+8.9893400904946651e-3)*x+5.5826318053295672e
-2)*x+2.401536170443754e-1)*x+6.9315307320016896e-1)*x+9.9999992506352618e-1
double f(double x)
{
    double u = 1.8775766751914795e-3;
    u = u * x + 8.9893400904946651e-3;
    u = u * x + 5.5826318053295672e-2;
    u = u * x + 2.401536170443754e-1;
    u = u * x + 6.9315307320016896e-1;
    return u * x + 9.9999992506352618e-1;
Title: Re: Fast Exp approximation
Post by: jj2007 on August 14, 2020, 08:23:28 AM
  fld FP4(0.5) ; x
  fld st
  fld st
  fld st
  fld st
  fmul FP8(1.8775766751914795e-3)
  fadd FP8(8.9893400904946651e-3)
  fmul
  fadd FP8(5.5826318053295672e-2)
  fmul
  fadd FP8(2.401536170443754e-1)
  fmul
  fadd FP8(6.9315307320016896e-1)
  fmul
  fadd FP8(9.9999992506352618e-1) ; result on FPU
  PrintLine Str$("2^0.5=\t  %Jf", ST(0)v), CrLf$, "expected: 1.414213562373095048"


Output:
2^0.5=    1.414213663708122130
expected: 1.414213562373095048
Title: Re: Fast Exp approximation
Post by: Siekmanski on August 14, 2020, 08:32:19 AM
Guga,
Thank you for compiling this piece of GOLD for me in Visual Studio.
LolRemez is really a very valuable program.
It opens doors to get very fast trigonometric functions with polynomials exactly the way I want them.

Hi Jochen,
This is cool stuff or not?
Now we can adjust the precision to real4 or real8 just by finding the correct degree of polynomials.
Title: Re: Fast Exp approximation
Post by: Siekmanski on August 14, 2020, 08:48:42 AM
Jochen,

You calculated Exp2

Exp2(0.5) is equal to sqrt(2.0)
It calculates the value of 2 raised to the power of x

Here is the FPU function for Exp
It calculates the value of e (the base of natural logarithms) raised to the power of x

FPU_Exp:
; st(0) == x
    fldl2e
    fmulp       st(1), st(0)
    fld         st(0)
    frndint
    fsub        st(1), st(0)
    fxch
    f2xm1
    fld1
    faddp
    fscale
    fstp        st(1)
    ret


result: 1.648721270700 ( for 0.5 )
Title: Re: Fast Exp approximation
Post by: jj2007 on August 14, 2020, 09:30:31 AM
Quote from: Siekmanski on August 14, 2020, 08:32:19 AM
Hi Jochen,
This is cool stuff or not?

Yessss :thumbsup:

Quote from: Siekmanski on August 14, 2020, 08:48:42 AM
You calculated Exp2

Exp2(0.5) is equal to sqrt(2.0)

Indeed. Normally I use Print Str$("2^0.5=%Jf\n", Exp2 (http://www.jj2007.eu/MasmBasicQuickReference.htm#Mb1193)(0.5)v). Output: 2^0.5=1.414213562373095049  :smiley:
Title: Re: Fast Exp approximation
Post by: guga on August 14, 2020, 02:18:53 PM
Quote from: Siekmanski on August 14, 2020, 07:08:51 AM
2**x is just the notation for the Exponent function, the second one is the weight function.
Just as you, I have to find my way in this LolRemez program.  :thumbsup:

The example -1 to 1 is often used in audio.
You stay always in that range, so you don't have compensation code for numbers outside that range.

I just tried to get the polynomials I used in my SSE2 example.
They are almost the same.
I will use the new ones. ( should be more precise since LolRemez has evolved to version 0.6  :cool:)

lolremez -d 5 -r 0:1 "2**x" "2**x"

// Approximation of f(x) = 2**x
// with weight function g(x) = 2**x
// on interval [ 0, 1 ]
// with a polynomial of degree 5.
// p(x)=((((1.8775766751914795e-3*x+8.9893400904946651e-3)*x+5.5826318053295672e
-2)*x+2.401536170443754e-1)*x+6.9315307320016896e-1)*x+9.9999992506352618e-1
double f(double x)
{
    double u = 1.8775766751914795e-3;
    u = u * x + 8.9893400904946651e-3;
    u = u * x + 5.5826318053295672e-2;
    u = u * x + 2.401536170443754e-1;
    u = u * x + 6.9315307320016896e-1;
    return u * x + 9.9999992506352618e-1;


Hi Marinus. You´re welcome  :thumbsup: :thumbsup: :thumbsup:

Btw,, i still don´t get this app to work properly.

I tried the 10th term and did this:
lolremez -d 10 -r 0:1 "2**x" "2**x"

It created this polynomials


// p(x)=(((((((((9.9522144894077596e-9*x+9.4609455637620191e-8)*x+1.331271998884894e-6)*x+1.5244659656760603e-5)*x+1.5403957490414841e-4)*x+1.3333543730974749e-3)*x+9.6181294091749148e-3)*x+5.5504108628244985e-2)*x+2.402265069613608e-1)*x+6.9314718055989127e-1)*x+1.0000000000000002
double f(double x)
{
    double u = 9.9522144894077596e-9;
    u = u * x + 9.4609455637620191e-8;
    u = u * x + 1.331271998884894e-6;
    u = u * x + 1.5244659656760603e-5;
    u = u * x + 1.5403957490414841e-4;
    u = u * x + 1.3333543730974749e-3;
    u = u * x + 9.6181294091749148e-3;
    u = u * x + 5.5504108628244985e-2;
    u = u * x + 2.402265069613608e-1;
    u = u * x + 6.9314718055989127e-1;
    return u * x + 1.0000000000000002;
}




But, when i tried to put them onto a function, it failed miserably. I don´t understand  what i´m doing wrong.   All i did was:


[<16 Cheby10Exp0: R$ 9.9522144894077596e-9, R$ 9.9522144894077596e-9]
[<16 Cheby10Exp1: R$ 9.4609455637620191e-8, R$ 9.4609455637620191e-8]
[<16 Cheby10Exp2: R$ 1.331271998884894e-6, R$ 1.331271998884894e-6]
[<16 Cheby10Exp3: R$ 1.5244659656760603e-5, R$ 1.5244659656760603e-5]
[<16 Cheby10Exp4: R$ 1.5403957490414841e-4, R$ 1.5403957490414841e-4]
[<16 Cheby10Exp5: R$ 1.3333543730974749e-3, R$ 1.3333543730974749e-3]
[<16 Cheby10Exp6: R$ 9.6181294091749148e-3, R$ 9.6181294091749148e-3]
[<16 Cheby10Exp7: R$ 5.5504108628244985e-2, R$ 5.5504108628244985e-2]
[<16 Cheby10Exp8: R$ 2.402265069613608e-1, R$ 2.402265069613608e-1]
[<16 Cheby10Exp9: R$ 6.9314718055989127e-1, R$ 6.9314718055989127e-1]
[<16 Cheby10Exp10: R$ 1.0000000000000002, R$ 1.0000000000000002]


Proc SSE2_Exp2::
    Arguments @Number, @Flag

    mov eax D@Flag
    Test_if eax SSE_EXP_INT
        cvtsi2ss xmm1 D@Number ; converts a signed integer to double
    Test_Else_if eax SSE_EXP_FLOAT
        ;movss xmm1 D@Number ; converts a single precision float to double
        cvtss2sd xmm1 X@Number ; converts a single precision float to double
    Test_Else_if eax SSE_EXP_REAL8
        mov eax D@Number | movss XMM1 X$eax
    Test_Else_if eax SSE_EXP_QWORD
        mov eax D@Number | movq XMM1 X$eax
    Test_Else
        xor eax eax | ExitP ; return 0 Invalid parameter
    Test_End



    movupd xmm0 X$Cheby10Exp0
    mulpd xmm0 xmm1 | addpd xmm0 X$Cheby10Exp1
    mulpd xmm0 xmm1 | addpd xmm0 X$Cheby10Exp2
    mulpd xmm0 xmm1 | addpd xmm0 X$Cheby10Exp3
    mulpd xmm0 xmm1 | addpd xmm0 X$Cheby10Exp4
    mulpd xmm0 xmm1 | addpd xmm0 X$Cheby10Exp5
    mulpd xmm0 xmm1 | addpd xmm0 X$Cheby10Exp6
    mulpd xmm0 xmm1 | addpd xmm0 X$Cheby10Exp7
    mulpd xmm0 xmm1 | addpd xmm0 X$Cheby10Exp8
    mulpd xmm0 xmm1 | addpd xmm0 X$Cheby10Exp9
    mulpd xmm0 xmm1 | addpd xmm0 X$Cheby10Exp10

EndP


It was supposed to work, because all the values and way to calculate as equal to the C version, right ?

I tried calculating exp of 5, but instead resulting in:

148.41315910257660342111558004055227962348766759387898904675284511

It returned:
31.987334657819531
Title: Re: Fast Exp approximation
Post by: Siekmanski on August 14, 2020, 02:51:54 PM
 :biggrin:

you created an Exp2 function.
Exp function calculates the value of e (the base of natural logarithms) raised to the power of x.
See the version I posted before.
I have to go to bed now, tomorrow I'll show you how to implement the 10th degree polynomials.
Title: Re: Fast Exp approximation
Post by: guga on August 14, 2020, 03:40:49 PM
Thanks to dodicat (From Freebasic forum), we also have a a FreeBasic conversion of this LOlRemex app.

Can someone, please port it to asm ?

"Compile the file poly.bas to .exe
Click it to get the command prompt.
It runs thus:
At the prompt something like:

poly "-6,8,cos(x)*sin(x)"
Remember to use the double quotes as above!
meaning from x = -6 to 8 the function cos(x)*sin(x) is approximated.
Press the spacebar to build up the approximation until you are happy that the curves are similar.
The standard way of showing the approximation coefficients is shown on the console, but you have the option (upon exit by pressing <esc>), to save to a .bas file, the nested method, which is much faster arithmetically for the computer."

Reference: https://www.freebasic.net/forum/viewtopic.php?t=25897
Title: Re: Fast Exp approximation
Post by: jj2007 on August 14, 2020, 06:58:05 PM
With your suggested input lolremez.exe poly "-6,8,cos(x)*sin(x)" it says input error.
With lolremez.exe "poly -6,8,cos(x)*sin(x)" it says EXPECTED (, got '-' but proceeds afterwards with nice graphs.
Title: Re: Fast Exp approximation
Post by: guga on August 14, 2020, 09:22:32 PM
Quote from: jj2007 on August 14, 2020, 06:58:05 PM
With your suggested input lolremez.exe poly "-6,8,cos(x)*sin(x)" it says input error.
With lolremez.exe "poly -6,8,cos(x)*sin(x)" it says EXPECTED (, got '-' but proceeds afterwards with nice graphs.

Tks a lot JJ. I saw your post on the original forum and gave another try and it compiled ok  :azn: :azn: :azn: I saw the graphics and it is really nice.


The command works without this "poly" token

I used as:
LolRemez "-3,3,x^3-x^2"
pause


(https://i.ibb.co/yqDbCGQ/Image1.png) (https://ibb.co/Zx3krtM)


It also exports a function to be compiled in freebasic containing the polynomials :)



I tried also with

LolRemez "0,1,exp(x)"

And it worked, but exported somethign i failed to understand. It exported this:

(((((((((((((((((((((((((((((((+0.04829789212904446)*x-0.1792981145327924)*x-0.6677612347982678)*x+5.893146189666752)*x _
-17.1355277003133)*x _
+26.0763728821472)*x _
-19.69798344538533)*x _
+0.9900030668437522)*x _
+10.30400147926733)*x _
-8.096178275550896)*x _
+11.14649562485596)*x _
-31.20839952020678)*x _
+51.6443661394669)*x _
-51.27397166179219)*x _
+31.83357261868321)*x _
-10.78621348361321)*x _
-0.4973412406197402)*x _
+2.927004122935927)*x _
-1.894224695114594)*x _
+0.741826698340469)*x _
-0.2034182656840364)*x _
+0.04056360695256523)*x _
-0.005880298181518774)*x _
+0.0008169520352547144)*x _
+0.001343698483337935)*x _
+0.008335523163337043)*x _
+0.04166660157481193)*x _
+0.1666666677131286)*x _
+0.4999999999927096)*x _
+1.000000000000015)*x _
+1) _


All of these polynomials are to be used ?
Title: Re: Fast Exp approximation
Post by: jj2007 on August 14, 2020, 09:42:46 PM
Quote from: guga on August 14, 2020, 09:22:32 PMThe command works without this "poly" token

I'm not sure about that - the output looks a lot different.
Title: Re: Fast Exp approximation
Post by: guga on August 14, 2020, 10:10:10 PM
I think that on each key you press, it is equivalent to one iteration on the original lolremez. I gave a try on the freebasic version with

LolRemez "0,1,exp(x)"

and hit 11 times the key until it produces a error of something around e-11. It produces these values:

(((((((((((+4.567028385771292e-007)*x+2.288390713140263e-006)*x+2.545465440196621e-005)*x _
+0.0001978554379130849)*x _
+0.001389191775841137)*x _
+0.008333228132164795)*x _
+0.04166668937151426)*x _
+0.1666666638148875)*x _
+0.5000000001833158)*x _
+0.9999999999954152)*x _
+1.000000000000019) _


I´ll give a try on them to see if it works, but i´m clueless how to use them yet. Better would be wait until Marinus wake up and explain to us how to use this lolremez app.

The graphics are very handy, btw, specially because we can actually see how much of the error we want to get rid off.
Title: Re: Fast Exp approximation
Post by: guga on August 14, 2020, 10:25:01 PM
Ok, i gave a try on the generated values from dodicat FreeBasic version. On 10 iterations, it produced the values i used below. But the result completely lacks accuracy

[<16 Cheby10Exp0: R$ 4.567028385771292e-007]
[<16 Cheby10Exp1: R$ 2.288390713140263e-006]
[<16 Cheby10Exp2: R$ 2.545465440196621e-005]
[<16 Cheby10Exp3: R$ 0.0001978554379130849]
[<16 Cheby10Exp4: R$ 0.001389191775841137]
[<16 Cheby10Exp5: R$ 0.008333228132164795]
[<16 Cheby10Exp6: R$ 0.04166668937151426]
[<16 Cheby10Exp7: R$ 0.1666666638148875]
[<16 Cheby10Exp8: R$ 0.5000000001833158]
[<16 Cheby10Exp9: R$ 0.9999999999954152]
[<16 Cheby10Exp10: R$ 1.000000000000019]


Proc SSE2_Exp2::
    Arguments @Number, @Flag

    mov eax D@Flag
    Test_if eax SSE_EXP_INT
        cvtsi2ss xmm1 D@Number ; converts a signed integer to double
    Test_Else_if eax SSE_EXP_FLOAT
        ;movss xmm1 D@Number ; converts a single precision float to double
        cvtss2sd xmm1 X@Number ; converts a single precision float to double
    Test_Else_if eax SSE_EXP_REAL8
        mov eax D@Number | movss XMM1 X$eax
    Test_Else_if eax SSE_EXP_QWORD
        mov eax D@Number | movq XMM1 X$eax
    Test_Else
        xor eax eax | ExitP ; return 0 Invalid parameter
    Test_End

    movsd xmm0 X$Cheby10Exp0
    mulsd xmm0 xmm1 | addsd xmm0 X$Cheby10Exp1
    mulsd xmm0 xmm1 | addsd xmm0 X$Cheby10Exp2
    mulsd xmm0 xmm1 | addsd xmm0 X$Cheby10Exp3
    mulsd xmm0 xmm1 | addsd xmm0 X$Cheby10Exp4
    mulsd xmm0 xmm1 | addsd xmm0 X$Cheby10Exp5
    mulsd xmm0 xmm1 | addsd xmm0 X$Cheby10Exp6
    mulsd xmm0 xmm1 | addsd xmm0 X$Cheby10Exp7
    mulsd xmm0 xmm1 | addsd xmm0 X$Cheby10Exp8
    mulsd xmm0 xmm1 | addsd xmm0 X$Cheby10Exp9
    mulsd xmm0 xmm1 | addsd xmm0 X$Cheby10Exp10


EndP


I gave a try calculating the exp of 5. The correct value of exp^5 should be:
148.41315910257660342111558004055227962348766759387898904675284511

But this lolremez app returned to me:
147.452655481459771

(Similar result as the original lolremez, btw). So, better wait Marinus wake up and help on this to explain why we are having so huge margin of error.
Title: Re: Fast Exp approximation
Post by: jack on August 15, 2020, 04:45:10 AM
you may also benefit from visiting the lolremez-wiki https://github.com/samhocevar/lolremez/wiki/Tutorial-1-of-5%3A-exp%28x%29-the-quick-way
the git repo is at https://github.com/samhocevar/lolremez you can build lolremez on windows using VS2019 and of course on linux
Title: Re: Fast Exp approximation
Post by: Siekmanski on August 15, 2020, 04:53:26 AM
Hi Jack,

Guga already compiled it for us.

Hi Guga,

Exp(x) = Exp2(x / Logn(2.0))
1/Logn(2.0) = 1.4426950408889634073599246810019

Now we can use ( x * 1.4426950408889634073599246810019 ) to calculate the Exp.

E = x * 1.442695
To use the polynomials we have to use only the fractional part of E [0 to 1]
When done, scale the result by adding the integer part to the exponent field of the floating point result.

Here are the 10 degree polynomials and the SSE2 routine.


lolremez -d 10 -r 0:1 "2**x" "2**x"

// Approximation of f(x) = 2**x
// with weight function g(x) = 2**x
// on interval [ 0, 1 ]
// with a polynomial of degree 10.
double f(double x)
{
    double u = 9.9522144894077596e-9;
    u = u * x + 9.4609455637620191e-8;
    u = u * x + 1.331271998884894e-6;
    u = u * x + 1.5244659656760603e-5;
    u = u * x + 1.5403957490414841e-4;
    u = u * x + 1.3333543730974749e-3;
    u = u * x + 9.6181294091749148e-3;
    u = u * x + 5.5504108628244985e-2;
    u = u * x + 2.402265069613608e-1;
    u = u * x + 6.9314718055989127e-1;
    return u * x + 1.0000000000000002;
}

.const
Chebylog2E      real4 4 dup (1.44269504089) ; 1/Logn(2.0)

Cheby10Exp0  real4 4 dup (9.9522144894077596e-9)
Cheby10Exp1  real4 4 dup (9.4609455637620191e-8)
Cheby10Exp2  real4 4 dup (1.331271998884894e-6)
Cheby10Exp3  real4 4 dup (1.5244659656760603e-5)
Cheby10Exp4  real4 4 dup (1.5403957490414841e-4)
Cheby10Exp5  real4 4 dup (1.3333543730974749e-3)
Cheby10Exp6  real4 4 dup (9.6181294091749148e-3)
Cheby10Exp7  real4 4 dup (5.5504108628244985e-2)
Cheby10Exp8  real4 4 dup (2.402265069613608e-1)
Cheby10Exp9  real4 4 dup (6.9314718055989127e-1)
Cheby10Exp10  real4 4 dup (1.0000000000000002)


.code
align 16
SSE2_Exp_10:  ; in: xmm0, out: xmm1
    movaps      xmm2,oword ptr Chebylog2E
    mulps       xmm2,xmm0
    psrld       xmm0,31
    cvttps2dq   xmm1,xmm2
    psubd       xmm1,xmm0
    movdqa      xmm0,xmm1
    cvtdq2ps    xmm1,xmm1
    subps       xmm2,xmm1
    movaps      xmm1,oword ptr Cheby10Exp0
    pslld       xmm0,23
    mulps       xmm1,xmm2
    addps       xmm1,oword ptr Cheby10Exp1
    mulps       xmm1,xmm2
    addps       xmm1,oword ptr Cheby10Exp2
    mulps       xmm1,xmm2
    addps       xmm1,oword ptr Cheby10Exp3
    mulps       xmm1,xmm2
    addps       xmm1,oword ptr Cheby10Exp4
    mulps       xmm1,xmm2
    addps       xmm1,oword ptr Cheby10Exp5
    mulps       xmm1,xmm2
    addps       xmm1,oword ptr Cheby10Exp6
    mulps       xmm1,xmm2
    addps       xmm1,oword ptr Cheby10Exp7
    mulps       xmm1,xmm2
    addps       xmm1,oword ptr Cheby10Exp8
    mulps       xmm1,xmm2
    addps       xmm1,oword ptr Cheby10Exp9
    mulps       xmm1,xmm2
    addps       xmm1,oword ptr Cheby10Exp10
    paddd       xmm0,xmm1
    ret
Title: Re: Fast Exp approximation
Post by: Siekmanski on August 15, 2020, 05:37:57 AM
Here is a real8 version,  where I explain the above exp math in code.

align 16
SSE41_Exp_10: ; in: xmm0, out: xmm0
    mulpd       xmm0,oword ptr NaturalLogE
    roundpd     xmm1,xmm0,1 ; lose the fractional parts and save the integer parts in xmm1
    subpd       xmm0,xmm1               ; Subtract them, now it's inside the range, results are values between 0.0 to 1.0
    cvtpd2dq    xmm2,xmm1 ; save the integer parts in xmm2
    psllq       xmm2,52 ; shift quadwords 52 bits left ( size of real8 mantissa ) to the integer parts of the exponent parts
    movapd      xmm1,xmm0 ; copy the range
    mulpd       xmm0,oword ptr Exp_Cheby10_0_pd
    addpd       xmm0,oword ptr Exp_Cheby10_1_pd
    mulpd       xmm0,xmm1
    addpd       xmm0,oword ptr Exp_Cheby10_2_pd
    mulpd       xmm0,xmm1
    addpd       xmm0,oword ptr Exp_Cheby10_3_pd
    mulpd       xmm0,xmm1
    addpd       xmm0,oword ptr Exp_Cheby10_4_pd
    mulpd       xmm0,xmm1
    addpd       xmm0,oword ptr Exp_Cheby10_5_pd
    mulpd       xmm0,xmm1
    addpd       xmm0,oword ptr Exp_Cheby10_6_pd
    mulpd       xmm0,xmm1
    addpd       xmm0,oword ptr Exp_Cheby10_7_pd
    mulpd       xmm0,xmm1
    addpd       xmm0,oword ptr Exp_Cheby10_8_pd
    mulpd       xmm0,xmm1
    addpd       xmm0,oword ptr Exp_Cheby10_9_pd
    mulpd       xmm0,xmm1
    addpd       xmm0,oword ptr Exp_Cheby10_10_pd
    paddq       xmm0,xmm2       ; add the exponent parts to the floating point result
    ret
Title: Re: Fast Exp approximation
Post by: guga on August 15, 2020, 05:42:08 AM
Quote from: Siekmanski on August 15, 2020, 04:53:26 AM
Hi Jack,

Guga already compiled it for us.

Hi Guga,

Exp(x) = Exp2(x / Logn(2.0))
1/Logn(2.0) = 1.4426950408889634073599246810019

Now we can use ( x * 1.4426950408889634073599246810019 ) to calculate the Exp.

E = x * 1.442695
To use the polynomials we have to use only the fractional part of E [0 to 1]
When done, scale the result by adding the integer part to the exponent field of the floating point result.

Here are the 10 degree polynomials and the SSE2 routine.


lolremez -d 10 -r 0:1 "2**x" "2**x"

// Approximation of f(x) = 2**x
// with weight function g(x) = 2**x
// on interval [ 0, 1 ]
// with a polynomial of degree 10.
double f(double x)
{
    double u = 9.9522144894077596e-9;
    u = u * x + 9.4609455637620191e-8;
    u = u * x + 1.331271998884894e-6;
    u = u * x + 1.5244659656760603e-5;
    u = u * x + 1.5403957490414841e-4;
    u = u * x + 1.3333543730974749e-3;
    u = u * x + 9.6181294091749148e-3;
    u = u * x + 5.5504108628244985e-2;
    u = u * x + 2.402265069613608e-1;
    u = u * x + 6.9314718055989127e-1;
    return u * x + 1.0000000000000002;
}

.const
Chebylog2E      real4 4 dup (1.44269504089) ; 1/Logn(2.0)

Cheby10Exp0  real4 4 dup (9.9522144894077596e-9)
Cheby10Exp1  real4 4 dup (9.4609455637620191e-8)
Cheby10Exp2  real4 4 dup (1.331271998884894e-6)
Cheby10Exp3  real4 4 dup (1.5244659656760603e-5)
Cheby10Exp4  real4 4 dup (1.5403957490414841e-4)
Cheby10Exp5  real4 4 dup (1.3333543730974749e-3)
Cheby10Exp6  real4 4 dup (9.6181294091749148e-3)
Cheby10Exp7  real4 4 dup (5.5504108628244985e-2)
Cheby10Exp8  real4 4 dup (2.402265069613608e-1)
Cheby10Exp9  real4 4 dup (6.9314718055989127e-1)
Cheby10Exp10  real4 4 dup (1.0000000000000002)


.code
align 16
SSE2_Exp_10:  ; in: xmm0, out: xmm1
    movaps      xmm2,oword ptr Chebylog2E
    mulps       xmm2,xmm0
    psrld       xmm0,31
    cvttps2dq   xmm1,xmm2
    psubd       xmm1,xmm0
    movdqa      xmm0,xmm1
    cvtdq2ps    xmm1,xmm1
    subps       xmm2,xmm1
    movaps      xmm1,oword ptr Cheby10Exp0
    pslld       xmm0,23
    mulps       xmm1,xmm2
    addps       xmm1,oword ptr Cheby10Exp1
    mulps       xmm1,xmm2
    addps       xmm1,oword ptr Cheby10Exp2
    mulps       xmm1,xmm2
    addps       xmm1,oword ptr Cheby10Exp3
    mulps       xmm1,xmm2
    addps       xmm1,oword ptr Cheby10Exp4
    mulps       xmm1,xmm2
    addps       xmm1,oword ptr Cheby10Exp5
    mulps       xmm1,xmm2
    addps       xmm1,oword ptr Cheby10Exp6
    mulps       xmm1,xmm2
    addps       xmm1,oword ptr Cheby10Exp7
    mulps       xmm1,xmm2
    addps       xmm1,oword ptr Cheby10Exp8
    mulps       xmm1,xmm2
    addps       xmm1,oword ptr Cheby10Exp9
    mulps       xmm1,xmm2
    addps       xmm1,oword ptr Cheby10Exp10
    paddd       xmm0,xmm1
    ret


Now i see where you got those values :) Thanks a lot.

One question....Why using the range only to 0 to 1 ? On github, the example is from -1 to 1 . When should we use them ?

Does 0 to 1 means only positive exponential ? Ex: exp^5
and
-1 to 1 means negative ? Ex: exp^-5 ??


Btw...did you checked the margin of error ?


I´ll give a try on the method you use and the one that uses a double check for error, like this:

lolremez -d 10 -r -1:1 "exp(x)"  "exp(x)"

What´s the difference from yours ? I mean why using "2**x" "2**x" ?

When i use "exp(x)" the values are different


Other question, why using...

    movaps      xmm2,oword ptr Chebylog2E
    mulps       xmm2,xmm0
    psrld       xmm0,31 ; Why ?
    cvttps2dq   xmm1,xmm2
    psubd       xmm1,xmm0
    movdqa      xmm0,xmm1
    cvtdq2ps    xmm1,xmm1
    subps       xmm2,xmm1
    movaps      xmm1,oword ptr Cheby10Exp0
    pslld       xmm0,23
(...)


wouldn´t only be necessary multiply this new value ?

Why not using only ?

    mulsd xmm0 X$Chebylog2E and later add and multiply by the remaining values ?
Title: Re: Fast Exp approximation
Post by: guga on August 15, 2020, 06:50:29 AM
tested for accuracy :)

It looses accuracy after the 4th digit after the "."

correct value:
exp^5 =  148.41315910257660342111558004055227962348766759387898904675284511

using lolremez
exp^5 = 148.413162231445312

This is similar to what i found yesterday. On the version i made it has 0 loss and the speed was not so different from the ones using lolremez.  Wouldn´t be better if we try to optimize the version i ported ? I´m pretty sure there´s room for optimization yet.

It´s juust a matter of trying to understand what are the values found on the Exptable.

For what i saw, those values (The bigger ones) seems to be exponents of  integers divided by 5000. While the others (around e-310 seems to be some shifted values of some sort.)

Ex: 2.04582073601169570e-16 = exp(-180628/5000)
2.42294657314453310e-310 = exp(-712.9163944364413) = exp(some shift or division by what ????)
4.99974487227263260e-17 = exp(-187673/5000)

And the "SSE_Shifter4" seems to be formed by 2 exponentials of log(2). Like this:

; (x) = exp(51 log(2) + log(3)) = 6.755399441055744e+15
; (y) = exp(-60 log(2)) = 8.67361737988403547e-19
[SSE_Shifter4: R$ 6.755399441055744e+15,    R$ 6.755399441055744e+15,
                        R$ 8.67361737988403547e-19,  R$ 8.67361737988403547e-19]
Title: Re: Fast Exp approximation
Post by: Siekmanski on August 15, 2020, 07:08:16 AM
Remember that real4 has only 7.2 digits precision so, you need real8 to get more digits ( I believe 15 for real8 )

QuoteOne question....Why using the range only to 0 to 1 ? On github, the example is from -1 to 1 . When should we use them ?

-> For example if you only need to do calculations in the range -1 to 1 as in e.g. Audio routines.
   It's saves you the correction code, splitting it up in fractions and integers and glue everything together into a float.

QuoteDoes 0 to 1 means only positive exponential ? Ex: exp^5
and
-1 to 1 means negative ? Ex: exp^-5 ??

-> No, the signs are taken care of.

QuoteBtw...did you checked the margin of error ?

-> No

QuoteI´ll give a try on the method you use and the one that uses a double check for error, like this:

lolremez -d 10 -r -1:1 "exp(x)"  "exp(x)"

What´s the difference from yours ? I mean why using "2**x" "2**x" ?

When i use "exp(x)" the values are different

-> I use the Exp2 polynomials to calculate Exp()
   2**x = 2 raised to the power of x = Exp2()
   Why? because, Exp() calculates the value of e (the base of natural logarithms) raised to the power of x.

-> By simplifying the math, we have 1 set of polynomials to calculate 2 functions, Exp() and Exp2()  :cool:
   Exp(x) = Exp2(x / Logn(2.0))
   Create a reciprocal: 1/Logn(2.0) = 1.442695
   Now we can use ( x * 1.442695 ) to calculate Exp().

QuoteOther question, why using...
    movaps      xmm2,oword ptr Chebylog2E
    mulps       xmm2,xmm0
    psrld       xmm0,31 ; Why ?
    cvttps2dq   xmm1,xmm2
    psubd       xmm1,xmm0
    movdqa      xmm0,xmm1
    cvtdq2ps    xmm1,xmm1
    subps       xmm2,xmm1
    movaps      xmm1,oword ptr Cheby10Exp0
    pslld       xmm0,23


-> In SSE2 we don't have a floor instruction ( SSE4_1 does: roundps xmm1,xmm0,1 )
   We need that to work outside the range of the polynomials.
   So we have to create it ourselfs:

    movaps      xmm2,oword ptr Chebylog2E
    mulps       xmm2,xmm0
    psrld       xmm0,31 ; put sign-bits in position
    cvttps2dq   xmm1,xmm2 ; convert with truncation, packed real4 to packed signed 32bit integers
    psubd       xmm1,xmm0 ; signed integers - sign bits
    movdqa      xmm0,xmm1 ; save them
    cvtdq2ps    xmm1,xmm1 ; convert 32bit packed integers to packed real4
    subps       xmm2,xmm1 ; get the fractional parts ( range 0.0 to 1.0 )
    movaps      xmm1,oword ptr Cheby10Exp0
    pslld       xmm0,23 ; shift 32bit integers 23 bits to the left, ( size of real4 mantissa ) into the exponent fields

at the end of the routine:
    paddd       xmm0,xmm1 ; add the exponent fields to the real4 results



Quotewouldn´t only be necessary multiply this new value ?

Why not using only ?

    mulsd xmm0 X$Chebylog2E and later add and multiply by the remaining values ?
Logged

-> Because it doesn't work that way, we need a range to work with the polynomials.
Title: Re: Fast Exp approximation
Post by: InfiniteLoop on August 15, 2020, 11:20:48 PM
edit:
Improved version:
.data
E_1 REAL4 1.0000
E_2 REAL4 0.5000
E_3 REAL4 0.16666
E_4 REAL4 0.04166
E_5 REAL4 0.00833
.code
ASM_Exp_2 proc
movss xmm4, xmm0
movss xmm1, xmm0
addss xmm1, E_1
mulss xmm4, xmm0
movss xmm2, E_2
vfmadd231ss xmm1, xmm2, xmm4
mulss xmm4, xmm0
movss xmm3, E_3
vfmadd231ss xmm1, xmm3, xmm4
mulss xmm4, xmm0
movss xmm2, E_4
vfmadd231ss xmm1, xmm2, xmm4
mulss xmm4, xmm0
movss xmm3, E_5
vfmadd231ss xmm1, xmm3, xmm4
movss xmm0, xmm1
ret
ASM_Exp_2 endp

end

Here's my rough & ready effort.


.data
N_1 REAL4 1.000000000000000
N_2 REAL4 2.000000000000000
N_3 REAL4 6.000000000000000
N_4 REAL4 24.000000000000000
N_5 REAL4 120.000000000000000
.code
ASM_Exp proc
movss xmm1, N_1
movss xmm2, xmm0
movss xmm3, xmm1
movss xmm5, xmm1
addss xmm1, xmm0    ;x+1
movss xmm4, N_2
divss xmm3, xmm4    ;1/n!
mulss xmm2, xmm0    ;x^n
vfmadd231ss xmm1, xmm2, xmm3
movss xmm3, xmm5
movss xmm4, N_3
divss xmm3, xmm4    ;1/n!
mulss xmm2, xmm0    ;x^n
vfmadd231ss xmm1, xmm2, xmm3
movss xmm3, xmm5
movss xmm4, N_4
divss xmm3, xmm4    ;1/n!
mulss xmm2, xmm0    ;x^n
vfmadd231ss xmm1, xmm2, xmm3
movss xmm3, xmm5
movss xmm4, N_5
divss xmm3, xmm4    ;1/n!
mulss xmm2, xmm0    ;x^n
vfmadd231ss xmm1, xmm2, xmm3
movss xmm0, xmm1
ret
ASM_Exp endp

end

Title: Re: Fast Exp approximation
Post by: Siekmanski on August 15, 2020, 11:46:18 PM
Quote from: InfiniteLoop on August 15, 2020, 11:20:48 PM
Here's my rough & ready effort.


.data
N_1 REAL4 1.000000000000000
N_2 REAL4 2.000000000000000
N_3 REAL4 6.000000000000000
N_4 REAL4 24.000000000000000
N_5 REAL4 120.000000000000000
.code
ASM_Exp proc
movss xmm1, N_1
movss xmm2, xmm0
movss xmm3, xmm1
movss xmm5, xmm1
addss xmm1, xmm0    ;x+1
movss xmm4, N_2
divss xmm3, xmm4    ;1/n!
mulss xmm2, xmm0    ;x^n
vfmadd231ss xmm1, xmm2, xmm3
movss xmm3, xmm5
movss xmm4, N_3
divss xmm3, xmm4    ;1/n!
mulss xmm2, xmm0    ;x^n
vfmadd231ss xmm1, xmm2, xmm3
movss xmm3, xmm5
movss xmm4, N_4
divss xmm3, xmm4    ;1/n!
mulss xmm2, xmm0    ;x^n
vfmadd231ss xmm1, xmm2, xmm3
movss xmm3, xmm5
movss xmm4, N_5
divss xmm3, xmm4    ;1/n!
mulss xmm2, xmm0    ;x^n
vfmadd231ss xmm1, xmm2, xmm3
movss xmm0, xmm1
ret
ASM_Exp endp

end


My computer is to old, can't handle FMA instructions.  :sad: