Fast Exp approximation

guga · August 12, 2020, 12:29:23 AM

A fast and precise exponent function using SSE2

In my AMD it took only 31 to 32 clocks in average. I´m quite sure it still have some more room to optimize

Code Select


[SSE_Cv4: R$ 92.3324826168936567, R$ 92.3324826168936567]
; Int64ToDoubleMagic
[SSE_Shifter4: R$ 6.755399441055744e+15,    R$ 6.755399441055744e+15,
               R$ 8.67361737988403547e-19,  R$ 8.67361737988403547e-19]
[SSE_ExpVar1: R$ 1.08304246962234174e-2, R$ 1.08304246962234174e-2]
[SSE_ExpVar2: R$ 2.5728046223276691e-14, R$ 2.5728046223276691e-14]
[SSE_ExpVar3: R$ 8.33216827061673306e-3, R$ 1.66666666698150945e-1]
[SSE_ExpVar4: R$ 4.16667208687284685e-2, R$ 4.99999999999566291e-1]

[ExpTable: R$ 0,                       0,                        2.04582073601169570e-16, 2.42294657314453310e-310
           R$ 4.99974487227263260e-17, 4.87227730461210550e-310, 7.35784687124741820e-18, 7.34827949904982370e-310
           R$ 8.18931763819551360e-17, 9.85124358964738520e-310, 1.66658814423267470e-18, 1.23814631722046340e-309
           R$ 1.34046152845389320e-16, 1.49392350395699130e-309, 1.44126483728110840e-16, 1.75248592164539530e-309
           R$ 1.75676685628521850e-16, 2.01386389946222700e-309, 4.77695942525622330e-17, 2.27808809684712520e-309
           R$ 9.34171060990504440e-17, 2.54518950709912420e-309, 4.58567032666235110e-17, 2.81519946101217220e-309
           R$ 7.82657325863607620e-17, 3.08814963055018380e-309, 2.82378442595106130e-17, 3.36407203256215700e-309
           R$ 3.29047266460084160e-17, 3.64299903253770950e-309, 4.72136812117012750e-17, 3.92496334840353820e-309
           R$ 3.34846233362515240e-17, 4.20999805436120270e-309, 5.52755004850524930e-17, 4.49813658476670150e-309
           R$ 1.19250041407656740e-16, 4.78941273805230130e-309, 1.65291010027318310e-16, 5.08386068069106740e-309
           R$ 3.75085420130312720e-17, 5.38151495120456640e-309, 1.71528210782905300e-16, 5.68241046421419000e-309
           R$ 2.10230496752157180e-18, 5.98658251453665700e-309, 1.33575100888345410e-17, 6.29406678132401250e-309
           R$ 1.95725852931120360e-17, 6.60489933224881300e-309, 1.14594728088946900e-16, 6.91911662773481080e-309
           R$ 1.45979432294737540e-16, 7.23675552523374920e-309, 6.66380458923219520e-17, 7.55785328354870360e-309
           R$ 5.68648095791174000e-17, 7.88244756720450720e-309, 7.00787504690699450e-17, 8.21057645086578490e-309
           R$ 1.11524233151705370e-16, 8.54227842380307460e-309, 8.99959272503798580e-17, 8.87759239440759380e-309
           R$ 8.86511592917582850e-17, 9.21655769475515650e-309, 1.46901932708742340e-16, 9.55921408521980240e-309
           R$ 1.32721817677819360e-16, 9.90560175913766330e-309, 1.13655151388858840e-16, 1.02557613475215990e-308
           R$ 1.26761474172717220e-16, 1.06097339238271920e-308, 9.50689710108795960e-18, 1.09675610087706390e-308
           R$ 7.97786311075428990e-17, 1.13292845751990740e-308, 7.32795774215157790e-17, 1.16949470530139930e-308
           R$ 5.15483011707867830e-17, 1.20645913341482120e-308, 2.42539857666898060e-17, 1.24382607775970770e-308
           R$ 1.34460827434261430e-16, 1.28159992145044320e-308, 7.60136434780041980e-17, 1.31978509533039990e-308
           R$ 1.53414100536037230e-17, 1.35838607849166940e-308, 9.51550643423588250e-17, 1.39740739880045840e-308
           R$ 7.33935310510226070e-17, 1.43685363342820510e-308, 3.54094826264618290e-17, 1.47672940938847590e-308
           R$ 4.87516052622706170e-17, 1.51703940407971140e-308, 1.25886762958400070e-16, 1.55778834583388210e-308
           R$ 1.18426926175418470e-16, 1.59898101447111830e-308, 1.82140544036225900e-17, 1.64062224186037870e-308
           R$ 1.68548729062897310e-17, 1.68271691248622430e-308, 3.62161593533689420e-17, 1.72526996402176670e-308
           R$ 1.01562190116415010e-17, 1.76828638790785260e-308, 6.74378629321924300e-17, 1.81177122993855490e-308
           R$ 1.79012690760451310e-17, 1.85572959085304300e-308, 5.26537076855627410e-17, 1.90016662693389250e-308
           R$ 8.58071433248992810e-17, 1.94508755061191610e-308, 1.79639326598330220e-17, 1.99049763107757670e-308
           R$ 6.04870235399373270e-17, 2.03640219489905820e-308, 5.33680587851415070e-17, 2.08280662664707050e-308
           R$ 4.57849152770600950e-17, 2.12971636952645360e-308, 2.04142788975783030e-17, 2.17713692601466200e-308]

[SSE_Exp_MMask0: Q$ SSE_EXPONENT_MMASK, SSE_EXPONENT_MMASK]
[SSE_Exp_Bias0: Q$ SSE_EXPONENT_BIAS, SSE_EXPONENT_BIAS]
[SSE_OneVal0: R$ 1.0]
[SSE_Exp_EMask0: Q$ SSE_EXPONENT_EMASK, SSE_EXPONENT_EMASK]

[SSe_Exp_RetVal: Q$ 0]
[SSE_TmpVar2: Q$ 0]

; Equates used

[SSE_EXPONENT_BIAS 0FFC0]       ;  2^16-2^6 = 2^16-64 = 65472
[SSE_EXPONENT_MMASK 0FFFFFFC0]   ;  2^32-2^6 = 4294967232 (in int)
[SSE_EXPONENT_EMASK 0FFF00000]   ;  4293918720 (in int)

; Parameters flag
[SSE_EXP_INT 1
 SSE_EXP_FLOAT 2
 SSE_EXP_REAL8 4
 SSE_EXP_QWORD 8]

; Values to return
[SSE_EXP_INVALID_PARAMETER 0-1] ; Invalid flag
[SSE_UNDERFLOW 0-1] ; The inputed number is underflow
[SSE_OVERFLOWN 0-2] ; The inputed number is overflow
[SSE_INFINITE 0-3] ; General error. The inputed number is infinite, or NAN etc

;;
    SSE2_Exp_precise
    
    This function calculates the exponential of a number using approximations on a higher level of accuracy.

    Parameters:
        Number (in) - The inputed value to be found. Your an insert any type of number to be calculated. Negative, positive,
                      Float, Integer, Double, Qword etc.
                      This parameter can be used as a Int or a Pointer to the value depending of the used flag.
        
        Flag. A set of flags to be used accordying to the type of input.
                Equate Name     Value       Description
                SSE_EXP_INT       1             The user inputed a Int value to be calculated.
                SSE_EXP_FLOAT     2             The inputed value is a Float (Real4) data.
                SSE_EXP_REAL8     3             The inputed value is a Pointer to a Real8 data.
                SSE_EXP_QWORD     4             The inputed value is a Pointer to a QWord data.

    Return Value:
        If the function suceeds, it returns TRUE and the resultant value is stored as a Double (Real8) in the low quadword of  xmm0 register
        
        If the function fails it returns a set of error cases described in the equates below:
        [SSE_EXP_INVALID_PARAMETER 0] ; Invalid flag specified.
        [SSE_UNDERFLOW 0-1] ; The inputed number is underflow
        [SSE_OVERFLOWN 0-2] ; The inputed number is overflow
        [SSE_INFINITE 0-3] ; General error. The inputed number is infinite, or NAN etc


    Example of usage:

    1 - The inputed value (-5) is a Signed Integer
            call Sse2_exp_precise 0-5, SSE_EXP_INT

    2 - The inputed value (-5) is a Float (Real4)
        [MyNumber: F$ -5.0]
            call Sse2_exp_precise D$MyNumber, SSE_EXP_FLOAT

    3 - The inputed value (5) is a Real8
        [MyNumber: R$ 5.0]
            call Sse2_exp_precise MyNumber, SSE_EXP_REAL8

    4 - The inputed value (5) is a Qword (int64)
        [MyNumber: Q$ 5.0]
            call Sse2_exp_precise MyNumber, SSE_EXP_QWORD
;;

Proc Sse2_exp_precise:
    Arguments @Number, @Flag
    Local @ControlWord
    Uses edx, ecx

    mov D@ControlWord 0

    mov eax D@Flag
    Test_if eax SSE_EXP_INT
        cvtsi2sd xmm0 D@Number ; converts a signed integer to double
    Test_Else_if eax SSE_EXP_FLOAT
        cvtss2sd xmm0 X@Number ; converts a single precision float to double
    Test_Else_if eax SSE_EXP_REAL8
        mov eax D@Number | movsd XMM0 X$eax
    Test_Else_if eax SSE_EXP_QWORD
        mov eax D@Number | movq XMM0 X$eax
    Test_Else
        xor eax eax | ExitP ; return 0 Invalid parameter
    Test_End

    unpcklpd XMM0 XMM0 | pextrw eax XMM0 3 | and eax 07FFF
    mov edx 0408F | sub edx eax
    sub eax 03C90 | or edx eax
    ...If edx < 080000000 ; Number is not zero
        movupd XMM1 X$SSE_Cv4
        movupd XMM6 X$SSE_Shifter4
        movupd XMM2 X$SSE_ExpVar1
        movupd XMM3 X$SSE_ExpVar2

        mulpd XMM1 XMM0 | addpd XMM1 XMM6 | movupd XMM7 XMM1
        subpd XMM1 XMM6 | mulpd XMM2 XMM1
        mulpd XMM3 XMM1
        movupd XMM5 X$SSE_ExpVar4
        subpd XMM0 XMM2
        movd eax XMM7 | mov ecx eax | and ecx 03F | shl ecx 04 | sar eax 06 | mov edx eax
        subpd XMM0 XMM3
        movupd XMM2 X$ecx+ExpTable
        movupd XMM4 X$SSE_ExpVar3
        mulpd XMM4 XMM0
        movupd XMM1 XMM0
        mulpd XMM0 XMM0
        addpd XMM5 XMM4
        mulsd XMM0 XMM0
        addsd XMM1 XMM2
        unpckhpd XMM2 XMM2
        movdqu XMM6 X$SSE_Exp_MMask0 | pand XMM7 XMM6
        movdqu XMM6 X$SSE_Exp_Bias0 | paddq XMM7 XMM6 | psllq XMM7 46
        mulpd XMM0 XMM5
        addsd XMM1 XMM0
        orpd XMM2 XMM7
        unpckhpd XMM0 XMM0
        addsd XMM0 XMM1
        add edx 037E
        ..If edx <= 077C ; Adjust 0
            mulsd XMM0 XMM2
            addsd XMM0 XMM2
            mov eax &TRUE
        ..Else
            fstcw W@ControlWord
            mov dx W@ControlWord | or dx 0300 | mov W$SSe_Exp_RetVal dx | fldcw W$SSe_Exp_RetVal
            mov edx eax | sar eax 1 | sub edx eax
            movdqu XMM6 X$SSE_Exp_EMask0
            pandn XMM6 XMM2
            add eax 03FF | movd XMM3 ax | psllq XMM3 52
            orpd XMM6 XMM3
            add edx 03FF | movd XMM4 dx | psllq XMM4 52
            movlpd X$SSe_Exp_RetVal XMM0 | fld R$SSe_Exp_RetVal
            movlpd X$SSE_TmpVar2 XMM0 | fld R$SSE_TmpVar2
            fmul ST1 ST0
            faddp ST1 ST0
            movlpd X$SSe_Exp_RetVal XMM4 | fld R$SSe_Exp_RetVal
            fmulp ST1 ST0
            fstp R$SSe_Exp_RetVal
            movlps XMM0 X$SSe_Exp_RetVal
            fldcw W@ControlWord
            pextrw ecx XMM0 3 | and ecx 07FF0
            If ecx >= 07FF0 ; Overflown
                ;xor eax eax
                mov eax SSE_OVERFLOWN
            Else_If ecx = 0 ; Found underflow
                ;xor eax eax
                mov eax SSE_UNDERFLOW
            Else
                mov eax &TRUE
            End_If
        ..End_If
    ...Else
        ; Number seems to be zero. Let´s confirm.
        mov eax D@Number | and eax 07FFFFFFF
        If eax < 040900000 ; Input is zero. Return 1
            addsd XMM0 X$SSE_OneVal0
            mov eax &TRUE
        Else
            ; Error. The number can be a underflow or overflow, NAN, INF etc
            mov eax 0-1 ; Underflow_Overflow
        End_If
    ...End_If

EndP

If someone can please test to benchmark it would be appreciated :) Attached the masm version ;)

Btw...below is another version (a inaccurate one

) that takes only 14 clocks

Code Select


[M_LN2_D: R$ 0.72134752044448170367708185223087490666832985327631055065761515174311334481361108009443447714665950725529579950184,
          R$ 0.72134752044448170367708185223087490666832985327631055065761515174311334481361108009443447714665950725529579950184] ; 0.5/(ln(2))
[SSE_FastExp_Var1: R$ 3070.95411, R$ 3070.95411]; 3 * 1024.0 - 1.05

(...)
The inputed value must be in xmm0 and..thats it. Only 10 instructions :bgrin: :bgrin: :bgrin:

    movups xmm2 X$M_LN2_D
    movups xmm1 X$SSE_FastExp_Var1 | mulpd xmm0 xmm2
    movups xmm2 xmm1 | addpd xmm2 xmm0
    psllq XMM2 11

    ; now we subtract
    subpd xmm1 xmm0
    psllq XMM1 11

    divpd xmm2 xmm1
    movups xmm0 xmm2

This code is adapted from
https://stackoverflow.com/questions/53872265/how-do-you-process-exp-with-sse2

basically, it does only this:

shl 11 ((Number*(0.5/(ln(2))+(3 * 1024.0 - 1.05)))
______________________________________________
shl 11 ((Number*(0.5/(ln(2))-(3 * 1024.0 - 1.05)))

Updated. Attached JJ benchmark test "GugatestExponent.zip"

hutch-- · August 12, 2020, 12:40:20 AM

Guga,

You need to build it so anyone can test it. I downloaded in but its a source of unknown assembler type.

guga · August 12, 2020, 12:52:48 AM

Unknown ?

That´s weird. I thought it was compatible to masm. I´ll ty to port it.

hutch-- · August 12, 2020, 01:12:49 AM

Just build it so we can test it for you.

guga · August 12, 2020, 02:04:44 AM

Ok. i guess now it will work.

jj2007 · August 12, 2020, 02:10:21 AM

I just see a box saying 3D frames

guga · August 12, 2020, 02:22:49 AM

try now.

It is only the function code i made. Not a working app. The returned value is stored in the 1st double in xmm0

Code Select


; #########################################################################
;
; The framing affects are drawn on the client area by a single procedure,
; "Frame3D". In the WmdProc message handling Proc, the WM_PAINT message
; calls another Proc called "Paint_Proc" which contains the procedure calls
; to "Frame3D".
;
; #########################################################################

.686p
.model flat,stdcall
.mmx
.xmm

OPTION CASEMAP:NONE



; #########################################################################

      include \masm32\include\windows.inc
      include \masm32\include\user32.inc
      include \masm32\include\kernel32.inc
      include \masm32\include\gdi32.inc
      
      includelib \masm32\lib\user32.lib
      includelib \masm32\lib\kernel32.lib
      includelib \masm32\lib\gdi32.lib

; #########################################################################

      ;=============
      ; Local macros
      ;=============

      szText MACRO Name, Text:VARARG
        LOCAL lbl
          jmp lbl
            Name db Text,0
          lbl:
        ENDM

      m2m MACRO M1, M2
        push M2
        pop  M1
      ENDM

      return MACRO arg
        mov eax, arg
        ret
      ENDM

        ;=================
        ; Local prototypes
        ;=================
;        Sse2_exp_precise PROTO :DWORD,:DWORD

    .data?
    XMMWORD2 struc
     data   dq 2 dup (?)
    XMMWORD2 ends
    


    .data

        szDisplayName db "3D Frames",0
        CommandLine   dd 0
        hWnd          dd 0
        hInstance     dd 0

SSE_Cv4 dq 92.3324826168936567, 92.3324826168936567
; Int64ToDoubleMagic
SSE_Shifter4 dq 6.755399441055744e+15,    6.755399441055744e+15,
                8.67361737988403547e-19,    8.67361737988403547e-19
SSE_ExpVar1 dq 1.08304246962234174e-2, 1.08304246962234174e-2
SSE_ExpVar2 dq 2.5728046223276691e-14, 2.5728046223276691e-14
SSE_ExpVar3 dq 8.33216827061673306e-3, 1.66666666698150945e-1
SSE_ExpVar4 dq 4.16667208687284685e-2, 4.99999999999566291e-1

ExpTable dq 0,                       0,                        2.04582073601169570e-16, 2.42294657314453310e-310
               dq 4.99974487227263260e-17, 4.87227730461210550e-310, 7.35784687124741820e-18, 7.34827949904982370e-310
               dq 8.18931763819551360e-17, 9.85124358964738520e-310, 1.66658814423267470e-18, 1.23814631722046340e-309
               dq 1.34046152845389320e-16, 1.49392350395699130e-309, 1.44126483728110840e-16, 1.75248592164539530e-309
               dq 1.75676685628521850e-16, 2.01386389946222700e-309, 4.77695942525622330e-17, 2.27808809684712520e-309
               dq 9.34171060990504440e-17, 2.54518950709912420e-309, 4.58567032666235110e-17, 2.81519946101217220e-309
               dq 7.82657325863607620e-17, 3.08814963055018380e-309, 2.82378442595106130e-17, 3.36407203256215700e-309
               dq 3.29047266460084160e-17, 3.64299903253770950e-309, 4.72136812117012750e-17, 3.92496334840353820e-309
               dq 3.34846233362515240e-17, 4.20999805436120270e-309, 5.52755004850524930e-17, 4.49813658476670150e-309
               dq 1.19250041407656740e-16, 4.78941273805230130e-309, 1.65291010027318310e-16, 5.08386068069106740e-309
               dq 3.75085420130312720e-17, 5.38151495120456640e-309, 1.71528210782905300e-16, 5.68241046421419000e-309
               dq 2.10230496752157180e-18, 5.98658251453665700e-309, 1.33575100888345410e-17, 6.29406678132401250e-309
               dq 1.95725852931120360e-17, 6.60489933224881300e-309, 1.14594728088946900e-16, 6.91911662773481080e-309
               dq 1.45979432294737540e-16, 7.23675552523374920e-309, 6.66380458923219520e-17, 7.55785328354870360e-309
               dq 5.68648095791174000e-17, 7.88244756720450720e-309, 7.00787504690699450e-17, 8.21057645086578490e-309
               dq 1.11524233151705370e-16, 8.54227842380307460e-309, 8.99959272503798580e-17, 8.87759239440759380e-309
               dq 8.86511592917582850e-17, 9.21655769475515650e-309, 1.46901932708742340e-16, 9.55921408521980240e-309
               dq 1.32721817677819360e-16, 9.90560175913766330e-309, 1.13655151388858840e-16, 1.02557613475215990e-308
               dq 1.26761474172717220e-16, 1.06097339238271920e-308, 9.50689710108795960e-18, 1.09675610087706390e-308
               dq 7.97786311075428990e-17, 1.13292845751990740e-308, 7.32795774215157790e-17, 1.16949470530139930e-308
               dq 5.15483011707867830e-17, 1.20645913341482120e-308, 2.42539857666898060e-17, 1.24382607775970770e-308
               dq 1.34460827434261430e-16, 1.28159992145044320e-308, 7.60136434780041980e-17, 1.31978509533039990e-308
               dq 1.53414100536037230e-17, 1.35838607849166940e-308, 9.51550643423588250e-17, 1.39740739880045840e-308
               dq 7.33935310510226070e-17, 1.43685363342820510e-308, 3.54094826264618290e-17, 1.47672940938847590e-308
               dq 4.87516052622706170e-17, 1.51703940407971140e-308, 1.25886762958400070e-16, 1.55778834583388210e-308
               dq 1.18426926175418470e-16, 1.59898101447111830e-308, 1.82140544036225900e-17, 1.64062224186037870e-308
               dq 1.68548729062897310e-17, 1.68271691248622430e-308, 3.62161593533689420e-17, 1.72526996402176670e-308
               dq 1.01562190116415010e-17, 1.76828638790785260e-308, 6.74378629321924300e-17, 1.81177122993855490e-308
               dq 1.79012690760451310e-17, 1.85572959085304300e-308, 5.26537076855627410e-17, 1.90016662693389250e-308
               dq 8.58071433248992810e-17, 1.94508755061191610e-308, 1.79639326598330220e-17, 1.99049763107757670e-308
               dq 6.04870235399373270e-17, 2.03640219489905820e-308, 5.33680587851415070e-17, 2.08280662664707050e-308
               dq 4.57849152770600950e-17, 2.12971636952645360e-308, 2.04142788975783030e-17, 2.17713692601466200e-308



SSE_Exp_MMask0 dd 0FFFFFFC0h, 0, 0FFFFFFC0h, 0
SSE_Exp_Bias0 dq 0FFC0h, 0FFC0h
SSE_OneVal0 dq 1.0
SSE_Exp_EMask0 dq 0FFF00000h, 0FFF00000h

SSe_Exp_RetVal dq 0
SSE_TmpVar2 dq 0

    .code

start:




    push 1
    push 5
    call Sse2_exp_precise
    invoke ExitProcess, 0

; #########################################################################

Sse2_exp_precise proc

ControlWord     = dword ptr -4
Number          = dword ptr  8
Flag            = dword ptr  0Ch

                push    ebp
                mov     ebp, esp
                sub     esp, 4
                push    edx
                push    ecx
                mov     [ebp+ControlWord], 0
                mov     eax, [ebp+Flag]
                test    eax, 1
                jz      short loc_4CE8D0
                cvtsi2sd xmm0, [ebp+Number]
                jmp     short loc_4CE905
; ---------------------------------------------------------------------------

loc_4CE8D0:                             ; CODE XREF: Sse2_exp_precise+17?j
                test    eax, 2
                jz      short loc_4CE8DE
                cvtss2sd xmm0, [ebp+Number]
                jmp     short loc_4CE905
; ---------------------------------------------------------------------------

loc_4CE8DE:                             ; CODE XREF: Sse2_exp_precise+25?j
                test    eax, 4
                jz      short loc_4CE8EE
                mov     eax, [ebp+Number]
                movsd   xmm0, qword ptr [eax]
                jmp     short loc_4CE905
; ---------------------------------------------------------------------------

loc_4CE8EE:                             ; CODE XREF: Sse2_exp_precise+33?j
                test    eax, 8
                jz      short loc_4CE8FE
                mov     eax, [ebp+Number]
                movq    xmm0, qword ptr [eax]
                jmp     short loc_4CE905
; ---------------------------------------------------------------------------

loc_4CE8FE:                             ; CODE XREF: Sse2_exp_precise+43?j
                xor     eax, eax
                jmp     loc_4CEAE7
; ---------------------------------------------------------------------------

loc_4CE905:                             ; CODE XREF: Sse2_exp_precise+1E?j
                                        ; Sse2_exp_precise+2C?j ...
                unpcklpd xmm0, xmm0
                pextrw  eax, xmm0, 3
                and     eax, 7FFFh
                mov     edx, 408Fh
                sub     edx, eax
                sub     eax, 3C90h
                or      edx, eax
                cmp     edx, 80000000h
                jnb     loc_4CEAC4
                movupd  xmm1, SSE_Cv4
                movupd  xmm6, SSE_Shifter4
                movupd  xmm2, SSE_ExpVar1
                movupd  xmm3, SSE_ExpVar2
                mulpd   xmm1, xmm0
                addpd   xmm1, xmm6
                movupd  xmm7, xmm1
                subpd   xmm1, xmm6
                mulpd   xmm2, xmm1
                mulpd   xmm3, xmm1
                movupd  xmm5, SSE_ExpVar4
                subpd   xmm0, xmm2
                movd    eax, xmm7
                mov     ecx, eax
                and     ecx, 3Fh
                shl     ecx, 4
                sar     eax, 6
                mov     edx, eax
                subpd   xmm0, xmm3
                movupd  xmm2, ExpTable[ecx]
                movupd  xmm4, SSE_ExpVar3
                mulpd   xmm4, xmm0
                movupd  xmm1, xmm0
                mulpd   xmm0, xmm0
                addpd   xmm5, xmm4
                mulsd   xmm0, xmm0
                addsd   xmm1, xmm2
                unpckhpd xmm2, xmm2
                movdqu  xmm6, xmmword ptr SSE_Exp_MMask0
                pand    xmm7, xmm6
                movdqu  xmm6, xmmword ptr SSE_Exp_Bias0
                paddq   xmm7, xmm6
                psllq   xmm7, 2Eh
                mulpd   xmm0, xmm5
                addsd   xmm1, xmm0
                orpd    xmm2, xmm7
                unpckhpd xmm0, xmm0
                addsd   xmm0, xmm1
                add     edx, 37Eh
                cmp     edx, 77Ch
                ja      loc_4CEA07
                mulsd   xmm0, xmm2
                addsd   xmm0, xmm2
                mov     eax, 1
                jmp     loc_4CEABF
; ---------------------------------------------------------------------------

loc_4CEA07:                             ; CODE XREF: Sse2_exp_precise+13F?j
                fstcw   word ptr [ebp+ControlWord]
                mov     dx, word ptr [ebp+ControlWord]
                or      dx, 300h
                mov     word ptr SSe_Exp_RetVal, dx
                fldcw   word ptr SSe_Exp_RetVal
                mov     edx, eax
                sar     eax, 1
                sub     edx, eax
                movdqu  xmm6, xmmword ptr SSE_Exp_EMask0
                pandn   xmm6, xmm2
                add     eax, 3FFh
                movd    xmm3, eax
                psllq   xmm3, 34h
                orpd    xmm6, xmm3
                add     edx, 3FFh
                movd    xmm4, edx
                psllq   xmm4, 34h
                movlpd  SSe_Exp_RetVal, xmm0
                fld     SSe_Exp_RetVal
                movlpd  SSE_TmpVar2, xmm0
                fld     SSE_TmpVar2
                fmul    st(1), st
                faddp   st(1), st
                movlpd  SSe_Exp_RetVal, xmm4
                fld     SSe_Exp_RetVal
                fmulp   st(1), st
                fstp    SSe_Exp_RetVal
                movlps  xmm0, SSe_Exp_RetVal
                fldcw   word ptr [ebp+ControlWord]
                pextrw  ecx, xmm0, 3
                and     ecx, 7FF0h
                cmp     ecx, 7FF0h
                jb      short loc_4CEAAE
                mov     eax, 0FFFFFFFEh
                jmp     short loc_4CEABF
; ---------------------------------------------------------------------------

loc_4CEAAE:                             ; CODE XREF: Sse2_exp_precise+1F5?j
                cmp     ecx, 0
                jnz     short loc_4CEABA
                mov     eax, 0FFFFFFFFh
                jmp     short loc_4CEABF
; ---------------------------------------------------------------------------

loc_4CEABA:                             ; CODE XREF: Sse2_exp_precise+201?j
                mov     eax, 1

loc_4CEABF:                             ; CODE XREF: Sse2_exp_precise+152?j
                                        ; Sse2_exp_precise+1FC?j ...
                jmp     loc_4CEAE7
; ---------------------------------------------------------------------------

loc_4CEAC4:                             ; CODE XREF: Sse2_exp_precise+77?j
                mov     eax, [ebp+Number]
                and     eax, 7FFFFFFFh
                cmp     eax, 40900000h
                jnb     short loc_4CEAE2
                addsd   xmm0, SSE_OneVal0
                mov     eax, 1
                jmp     short loc_4CEAE7
; ---------------------------------------------------------------------------

loc_4CEAE2:                             ; CODE XREF: Sse2_exp_precise+221?j
                mov     eax, 0FFFFFFFFh

loc_4CEAE7:                             ; CODE XREF: Sse2_exp_precise+50?j
                                        ; Sse2_exp_precise:loc_4CEABF?j ...
                pop     ecx
                pop     edx
                mov     esp, ebp
                pop     ebp
                retn    8
Sse2_exp_precise endp

; #########################################################################

end start

jj2007 · August 12, 2020, 04:54:02 AM

It looks like you used the asm output of a compiler. What does...

Code Select

    push 1
    push 5
    call Sse2_exp_precise

... perform? What are the parameters? What is the expected output? I get 148.4131591 in the low double of xmm0, does that make sense, and how?

What exactly is an "exp function"? I consider this an "exp function":

Code Select

include \masm32\MasmBasic\MasmBasic.inc
MyDouble double ?
Init
  ExpXY(4, 0.5, MyDouble)
  Inkey Str$("4^0.5=%9f", MyDouble)	; Output: 4^0.5=2.00000000
EndOfCode

P.S., to make your source assemble with UAsm:

Code Select

movupd  xmm6, OWORD ptr SSE_Shifter4
movupd  xmm2, OWORD ptr ExpTable[ecx]

guga · August 12, 2020, 06:33:55 AM

Hi JJ

Yes, when you input 5 as an integer value, the result is 148.4131591....

The parameters are:

1st Parameter = The value to be calculated. It can be a integer, a Float or a address to Real8 (or qword)
2nd Parameter - The flag to be used on each type of input. I build a set of equates for that.
SSE_EXP_INT equ 1
SSE_EXP_FLOAT equ 2
SSE_EXP_REAL8 equ 4
SSE_EXP_QWORD equ 8

So, if the input is a integer you must use
invoke Sse2_exp_precise, 5, 1
or
invoke Sse2_exp_precise, 5, SSE_EXP_INT

If the input if a Float (Real4) you use:

MyValue Real4 5.0 ; I believe this is the masm syntax for Float, right ?

invoke Sse2_exp_precise, [MyValue], 2

or
invoke Sse2_exp_precise, [MyValue], SSE_EXP_FLOAT

If the input is a Real8, you must use as a input the address (Offset) of the Value, like this:

MyValue Real8 5.0
invoke Sse2_exp_precise, offset MyValue, 4

or

MyValue Real8 5.0
invoke Sse2_exp_precise, offset MyValue, SSE_EXP_REAL8

And, the same for a Qword (int64). I don´t remember what is the syntax for a Qword in masm. But it is for integers in 64 bits.

It works for either positive or negative values.

Exp is the exponencial function.

exp(5) = 148.4131591025766034211155800405522796234876675938789890467

on mine version it is close (If not, equal)
exp(5) = 148.4131591...

I was trying to make it work with masmbasic, but i don´t know how to setup the output for ir display the result 148.41....

I tried this:

Code Select


  push esi
  push loops-1
  NanoTimer()
	push 1
	push 5
	call	Sse2_exp_precise
	; cvtsd2ss eax, xmm0 	; convert to int
	cvtsd2ss xmm0, xmm0
	
; 	cvtsd2si  eax, xmm0
; 	fstp 	tmp8
	
	Print Str$("## SSe_2_exp_Precise result Exp(5) = %f", eax),  Str$("<>%i ##\n\n", esi)

But...got nowehere. I only got the amount of time the function took to work. (Attached file). It assembled on masmbasic and worked, but i don´t know how to set a variable in masm basic to see the result.

jj2007 · August 12, 2020, 11:50:31 AM

Here it is - your source, formatted and ready to launch, Str$(f:xmm0) did the job ;-)

Code Select

  REPEAT 3
  NanoTimer()
  push loops-1
  .Repeat
	push 1	; 
	push 5
	call Sse2_exp_precise	; invoke Sse2_exp_precise, 5, 1
	dec stack
  .Until Sign?
  pop edx
  Print Str$("%i Mio Sse2_exp_precise iterations took ", loops/1000000), NanoTimer$(), Str$(" - result Exp(5) = %9f\n", f:xmm0)
  NanoTimer()
  push loops-1
  .Repeat
	ExpE(5, tmp8)
	dec stack
  .Until Sign?
  pop edx
  Print Str$("%i Mio MasmBasic ExpE() iterations took ", loops/1000000), NanoTimer$(), Str$(" - result Exp(5) = %9f\n\n", tmp8)
  ENDM

You made me discover a bug in MasmBasic, therefore it will build correctly only with MB version 12 August 2020

Actually, it was not really a bug but...

MbExp PROTO
if 0
ffree st(7) ; MbExp[-12]
fldl2e
jmp @F
Mb10X: ffree st(7) ; MbExp[-6]
fldl2t
@@: fmul
endif
align 4
MbExp proc
sub esp, 3*DWORD
r10 equ stack
mov eax, 3fffh ; credits to bitRake, qWord & Agner Fog
xor edx, edx

What happened here is that a few years ago, when looking at this, I didn't quite understand what this code did (April 2013 - oooold), so I commented it out with the if 0, and from then on a call MbExp[-12] ended up in No Man's Land. A clear case of not enough comments

guga · August 12, 2020, 03:05:15 PM

Timmings Result

JJ, can you please see if i ported it correctly this time? I was thinking a bit weird the result be that slow and gave a try with another timming example you made using Dave routine called "f8Exp". I replaced his version with mine to test the speed of the algo

Can you please see if the porting is ok, and also try to force the result to export from xmm0 to MyReal8 to we also check the resultant value and the timmings ?

Here, the timmings are now Ok. They fits to the results i´ve got earlier.

Code Select


AMD Ryzen 5 2400G with Radeon Vega Graphics     (SSE4)

2076    cycles for 100 * Sse2_exp_precise (Guga SSE2 Exp precise , 2.7182818^5)
11193   cycles for 100 * ExpXY (MasmBasic, 2.7182818^5)
25992   cycles for 100 * pow (CRT, 2.7182818^5)

2024    cycles for 100 * Sse2_exp_precise (Guga SSE2 Exp precise , 2.7182818^5)
11148   cycles for 100 * ExpXY (MasmBasic, 2.7182818^5)
29457   cycles for 100 * pow (CRT, 2.7182818^5)

2039    cycles for 100 * Sse2_exp_precise (Guga SSE2 Exp precise , 2.7182818^5)
11639   cycles for 100 * ExpXY (MasmBasic, 2.7182818^5)
25994   cycles for 100 * pow (CRT, 2.7182818^5)

2065    cycles for 100 * Sse2_exp_precise (Guga SSE2 Exp precise , 2.7182818^5)
11204   cycles for 100 * ExpXY (MasmBasic, 2.7182818^5)
27472   cycles for 100 * pow (CRT, 2.7182818^5)

2033    cycles for 100 * Sse2_exp_precise (Guga SSE2 Exp precise , 2.7182818^5)
12207   cycles for 100 * ExpXY (MasmBasic, 2.7182818^5)
25956   cycles for 100 * pow (CRT, 2.7182818^5)

148.413159102577 for Sse2_exp_precise (Guga SSE2 Exp precise , 2.7182818^5)
148.413159102577 for ExpXY (MasmBasic, 2.7182818^5)
148.413159102577 for pow (CRT, 2.7182818^5)

--- ok ---

If the test is ok, you guys can use the function if needed, and also change it to it be more readable using the correct masm macros etc. I finished a similar routine for calculating a fast Logarithm and fast pow of a number.I should post it soon if the tests of this sse2 exp precise are ok.

jj2007 · August 12, 2020, 06:09:18 PM

Quote from: guga on August 12, 2020, 03:05:15 PMJJ, can you please see if i ported it correctly this time?

I got 30 times "error A2071:initializer magnitude too large for specified size"

Then I switched to UAsm64 (see OPT_Assembler at the end), and everything was ok

@Nidud: with AsmC /Znk I get that error A2071

Congrats, Guga now it is definitely very fast - what is so different now?

Code Select

Intel(R) Core(TM) i5-2450M CPU @ 2.50GHz (SSE4)

2669    cycles for 100 * Sse2_exp_precise (Guga SSE2 Exp precise , 2.7182818^5)
10020   cycles for 100 * ExpXY (MasmBasic, 2.7182818^5)
20399   cycles for 100 * pow (CRT, 2.7182818^5)

Thanks for the offer to use your code. I will do so if I need a really fast exponential function. For the library, I hesitate, because I am already fighting with the bloat ;-)

Code Select

1788    bytes for Sse2_exp_precise (Guga SSE2 Exp precise , 2.7182818^5)
29      bytes for ExpXY (MasmBasic, 2.7182818^5)
39      bytes for pow (CRT, 2.7182818^5)

guga · August 12, 2020, 07:15:08 PM

Quote from: jj2007 on August 12, 2020, 06:09:18 PM
Quote from: guga on August 12, 2020, 03:05:15 PMJJ, can you please see if i ported it correctly this time?

I got 30 times "error A2071:initializer magnitude too large for specified size"

Then I switched to UAsm64 (see OPT_Assembler at the end), and everything was ok

@Nidud: with AsmC /Znk I get that error A2071

Congrats, Guga now it is definitely very fast - what is so different now?

Code Select Expand
Intel(R) Core(TM) i5-2450M CPU @ 2.50GHz (SSE4) 2669 cycles for 100 * Sse2_exp_precise (Guga SSE2 Exp precise , 2.7182818^5) 10020 cycles for 100 * ExpXY (MasmBasic, 2.7182818^5) 20399 cycles for 100 * pow (CRT, 2.7182818^5)

Thanks for the offer to use your code. I will do so if I need a really fast exponential function. For the library, I hesitate, because I am already fighting with the bloat ;-)

Code Select Expand
1788 bytes for Sse2_exp_precise (Guga SSE2 Exp precise , 2.7182818^5) 29 bytes for ExpXY (MasmBasic, 2.7182818^5) 39 bytes for pow (CRT, 2.7182818^5)

I simply updated masm basic and used the same routine you did on another post for dedndave (TimingsPow.asc). I assembled it with masm and masmbasic.

About the difference of speed, i didn´t changed the previous code, i simply used the correct base and set the parameters according. This time i made a test using Real8 values because the other functions were also being tested with Real 8.

The parameters for the function was writen as:

   push 4   ; The value "4" we use to compute Real8 as i explained on the previous post.
   push offset MyExpo ; Pointer to Real8 or to a Qword only when we are using 64 bits values. For 32 Bits values (dword, Float etc), we use the value directly because int or Float are not pointers. (So, without the "offset" thing in masm)
   call Sse2_exp_precise

The algorithm works only for the exponential of a number (not general power of a number). So, i mainly used the Euler number as a base like this:

MyBaseA   REAL8 2.71828182845904523536028747135
MyBaseB   REAL8 2.71828182845904523536028747135
MyBaseC   REAL8 2.71828182845904523536028747135

https://en.wikipedia.org/wiki/Euler%27s_formula

Your others functions on your example were calculating "pow", so i mainly applied all of them, to use the same base as my algo to compare the results.

I´ll later post other function that calculates a general pow of a number that seems to be as fast as this one and we can then test the results

I´m pretty sure there´s more room for optimization and also avoiding it to use all xmm register. I would prefer the algo was using only xmm0, xmm1 or xmm2, but i couldn´t suceed to limit the numbers of SSE registers being used yet.

The function was originally inside ucrtbase,dll in windows10 . Its a undocumented function, but the original code is a true mess and a bit slower then mine. So, i adapted it to work as expected, but instead using only xmm0 as input, i enabled the possibility to the user inputs whatever number format he wants (integer, floats, doubles etc) and also returns in eax error cases. MY main concern is about precision.The algo has a better precision then other libraries, but i don´t know in how many digits yet it starts loosing precision.

Currently, i´m inserting the function to be used as dll and making a small guide for it and allows teh user to also export the result on whatever way he wants (If it wouldn´t affect the general speed, of course)

Btw...if you want to play and have fun with speed, you may give a try on the other piece of code i provided as an example in the 1st post. Just use xmm0 as input and Voilà

. This piece of code below is speed as possible (even faster then mine), but it completely lacks precision.

Code Select


M_LN2_D Real8 0.72134752044448170367708185223087490666832985327631055065761515174311334481361108009443447714665950725529579950184,
                Real8 0.72134752044448170367708185223087490666832985327631055065761515174311334481361108009443447714665950725529579950184 ; 0.5/(ln(2))
SSE_FastExp_Var1 Real8 3070.95411, Real8 3070.95411 ; 3 * 1024.0 - 1.05 

"Real8" is the proper masm syntax to define a Double (64 Bits) floating values, right ???

(...)
    movups xmm2, X$M_LN2_D ; X$ is Rosasm syntax for xmmword.
    movups xmm1, X$SSE_FastExp_Var1
    mulpd xmm0, xmm2 ; The input number must be in xmm0
    movups xmm2, xmm1
    addpd xmm2, xmm0
    psllq XMM2, 11

    ; now we subtract
    subpd xmm1, xmm0
    psllq XMM1, 11

    divpd xmm2, xmm1
    movups xmm0, xmm2

jj2007 · August 12, 2020, 07:56:23 PM

Very nice work, Guga

I've put together a little testbed for examining the pattern of the exponential function, see attachment. My suspicion is that it should be possible to calculate the mantissa from a table, and to apply the exponent afterwards. Maybe following the Fast Sinus() and Cosinus() algos path

But math is my weak point. What do you see there?

guga · August 12, 2020, 08:45:29 PM

Quote from: jj2007 on August 12, 2020, 07:56:23 PM
Very nice work, Guga

I've put together a little testbed for examining the pattern of the exponential function, see attachment. My suspicion is that it should be possible to calculate the mantissa from a table, and to apply the exponent afterwards. Maybe following the Fast Sinus() and Cosinus() algos path

But math is my weak point. What do you see there?

Hi JJ

I see this:

Btw..how you managed to create this curve window ? I like that a lot

About the fastsin, cos algos in the post, i´ll take a look today. Calculating the mantissa from a table, should indeed be the way to go. But, i have no ide how to do it. My knowledge in math is also a bit limited. I´m not used to those complex math symbols anymore. Some i have to push from memory when i tried the engineering university when i was a young (Before i entered to law school, i tried engineering, but i quit. Did mainly 3 years or something.), but...it´s hard because i´m talking of something i was studying more then 30 years ago

Btw...The precision starts loosing after the 6th or 7th digit i suppose. In Rosasm i can see untill 14 digits so it´s easier to check when they are different, but on your test, it have less digits to compare

The MASM Forum

News:

Fast Exp approximation

guga

hutch--

guga

hutch--

guga

jj2007

guga

jj2007

guga

jj2007

guga

jj2007

guga

jj2007

guga