A fast and precise exponent function using SSE2 :azn: :azn: :azn:
In my AMD it took only 31 to 32 clocks in average. I´m quite sure it still have some more room to optimize :mrgreen: :mrgreen:
[SSE_Cv4: R$ 92.3324826168936567, R$ 92.3324826168936567]
; Int64ToDoubleMagic
[SSE_Shifter4: R$ 6.755399441055744e+15, R$ 6.755399441055744e+15,
R$ 8.67361737988403547e-19, R$ 8.67361737988403547e-19]
[SSE_ExpVar1: R$ 1.08304246962234174e-2, R$ 1.08304246962234174e-2]
[SSE_ExpVar2: R$ 2.5728046223276691e-14, R$ 2.5728046223276691e-14]
[SSE_ExpVar3: R$ 8.33216827061673306e-3, R$ 1.66666666698150945e-1]
[SSE_ExpVar4: R$ 4.16667208687284685e-2, R$ 4.99999999999566291e-1]
[ExpTable: R$ 0, 0, 2.04582073601169570e-16, 2.42294657314453310e-310
R$ 4.99974487227263260e-17, 4.87227730461210550e-310, 7.35784687124741820e-18, 7.34827949904982370e-310
R$ 8.18931763819551360e-17, 9.85124358964738520e-310, 1.66658814423267470e-18, 1.23814631722046340e-309
R$ 1.34046152845389320e-16, 1.49392350395699130e-309, 1.44126483728110840e-16, 1.75248592164539530e-309
R$ 1.75676685628521850e-16, 2.01386389946222700e-309, 4.77695942525622330e-17, 2.27808809684712520e-309
R$ 9.34171060990504440e-17, 2.54518950709912420e-309, 4.58567032666235110e-17, 2.81519946101217220e-309
R$ 7.82657325863607620e-17, 3.08814963055018380e-309, 2.82378442595106130e-17, 3.36407203256215700e-309
R$ 3.29047266460084160e-17, 3.64299903253770950e-309, 4.72136812117012750e-17, 3.92496334840353820e-309
R$ 3.34846233362515240e-17, 4.20999805436120270e-309, 5.52755004850524930e-17, 4.49813658476670150e-309
R$ 1.19250041407656740e-16, 4.78941273805230130e-309, 1.65291010027318310e-16, 5.08386068069106740e-309
R$ 3.75085420130312720e-17, 5.38151495120456640e-309, 1.71528210782905300e-16, 5.68241046421419000e-309
R$ 2.10230496752157180e-18, 5.98658251453665700e-309, 1.33575100888345410e-17, 6.29406678132401250e-309
R$ 1.95725852931120360e-17, 6.60489933224881300e-309, 1.14594728088946900e-16, 6.91911662773481080e-309
R$ 1.45979432294737540e-16, 7.23675552523374920e-309, 6.66380458923219520e-17, 7.55785328354870360e-309
R$ 5.68648095791174000e-17, 7.88244756720450720e-309, 7.00787504690699450e-17, 8.21057645086578490e-309
R$ 1.11524233151705370e-16, 8.54227842380307460e-309, 8.99959272503798580e-17, 8.87759239440759380e-309
R$ 8.86511592917582850e-17, 9.21655769475515650e-309, 1.46901932708742340e-16, 9.55921408521980240e-309
R$ 1.32721817677819360e-16, 9.90560175913766330e-309, 1.13655151388858840e-16, 1.02557613475215990e-308
R$ 1.26761474172717220e-16, 1.06097339238271920e-308, 9.50689710108795960e-18, 1.09675610087706390e-308
R$ 7.97786311075428990e-17, 1.13292845751990740e-308, 7.32795774215157790e-17, 1.16949470530139930e-308
R$ 5.15483011707867830e-17, 1.20645913341482120e-308, 2.42539857666898060e-17, 1.24382607775970770e-308
R$ 1.34460827434261430e-16, 1.28159992145044320e-308, 7.60136434780041980e-17, 1.31978509533039990e-308
R$ 1.53414100536037230e-17, 1.35838607849166940e-308, 9.51550643423588250e-17, 1.39740739880045840e-308
R$ 7.33935310510226070e-17, 1.43685363342820510e-308, 3.54094826264618290e-17, 1.47672940938847590e-308
R$ 4.87516052622706170e-17, 1.51703940407971140e-308, 1.25886762958400070e-16, 1.55778834583388210e-308
R$ 1.18426926175418470e-16, 1.59898101447111830e-308, 1.82140544036225900e-17, 1.64062224186037870e-308
R$ 1.68548729062897310e-17, 1.68271691248622430e-308, 3.62161593533689420e-17, 1.72526996402176670e-308
R$ 1.01562190116415010e-17, 1.76828638790785260e-308, 6.74378629321924300e-17, 1.81177122993855490e-308
R$ 1.79012690760451310e-17, 1.85572959085304300e-308, 5.26537076855627410e-17, 1.90016662693389250e-308
R$ 8.58071433248992810e-17, 1.94508755061191610e-308, 1.79639326598330220e-17, 1.99049763107757670e-308
R$ 6.04870235399373270e-17, 2.03640219489905820e-308, 5.33680587851415070e-17, 2.08280662664707050e-308
R$ 4.57849152770600950e-17, 2.12971636952645360e-308, 2.04142788975783030e-17, 2.17713692601466200e-308]
[SSE_Exp_MMask0: Q$ SSE_EXPONENT_MMASK, SSE_EXPONENT_MMASK]
[SSE_Exp_Bias0: Q$ SSE_EXPONENT_BIAS, SSE_EXPONENT_BIAS]
[SSE_OneVal0: R$ 1.0]
[SSE_Exp_EMask0: Q$ SSE_EXPONENT_EMASK, SSE_EXPONENT_EMASK]
[SSe_Exp_RetVal: Q$ 0]
[SSE_TmpVar2: Q$ 0]
; Equates used
[SSE_EXPONENT_BIAS 0FFC0] ; 2^16-2^6 = 2^16-64 = 65472
[SSE_EXPONENT_MMASK 0FFFFFFC0] ; 2^32-2^6 = 4294967232 (in int)
[SSE_EXPONENT_EMASK 0FFF00000] ; 4293918720 (in int)
; Parameters flag
[SSE_EXP_INT 1
SSE_EXP_FLOAT 2
SSE_EXP_REAL8 4
SSE_EXP_QWORD 8]
; Values to return
[SSE_EXP_INVALID_PARAMETER 0-1] ; Invalid flag
[SSE_UNDERFLOW 0-1] ; The inputed number is underflow
[SSE_OVERFLOWN 0-2] ; The inputed number is overflow
[SSE_INFINITE 0-3] ; General error. The inputed number is infinite, or NAN etc
;;
SSE2_Exp_precise
This function calculates the exponential of a number using approximations on a higher level of accuracy.
Parameters:
Number (in) - The inputed value to be found. Your an insert any type of number to be calculated. Negative, positive,
Float, Integer, Double, Qword etc.
This parameter can be used as a Int or a Pointer to the value depending of the used flag.
Flag. A set of flags to be used accordying to the type of input.
Equate Name Value Description
SSE_EXP_INT 1 The user inputed a Int value to be calculated.
SSE_EXP_FLOAT 2 The inputed value is a Float (Real4) data.
SSE_EXP_REAL8 3 The inputed value is a Pointer to a Real8 data.
SSE_EXP_QWORD 4 The inputed value is a Pointer to a QWord data.
Return Value:
If the function suceeds, it returns TRUE and the resultant value is stored as a Double (Real8) in the low quadword of xmm0 register
If the function fails it returns a set of error cases described in the equates below:
[SSE_EXP_INVALID_PARAMETER 0] ; Invalid flag specified.
[SSE_UNDERFLOW 0-1] ; The inputed number is underflow
[SSE_OVERFLOWN 0-2] ; The inputed number is overflow
[SSE_INFINITE 0-3] ; General error. The inputed number is infinite, or NAN etc
Example of usage:
1 - The inputed value (-5) is a Signed Integer
call Sse2_exp_precise 0-5, SSE_EXP_INT
2 - The inputed value (-5) is a Float (Real4)
[MyNumber: F$ -5.0]
call Sse2_exp_precise D$MyNumber, SSE_EXP_FLOAT
3 - The inputed value (5) is a Real8
[MyNumber: R$ 5.0]
call Sse2_exp_precise MyNumber, SSE_EXP_REAL8
4 - The inputed value (5) is a Qword (int64)
[MyNumber: Q$ 5.0]
call Sse2_exp_precise MyNumber, SSE_EXP_QWORD
;;
Proc Sse2_exp_precise:
Arguments @Number, @Flag
Local @ControlWord
Uses edx, ecx
mov D@ControlWord 0
mov eax D@Flag
Test_if eax SSE_EXP_INT
cvtsi2sd xmm0 D@Number ; converts a signed integer to double
Test_Else_if eax SSE_EXP_FLOAT
cvtss2sd xmm0 X@Number ; converts a single precision float to double
Test_Else_if eax SSE_EXP_REAL8
mov eax D@Number | movsd XMM0 X$eax
Test_Else_if eax SSE_EXP_QWORD
mov eax D@Number | movq XMM0 X$eax
Test_Else
xor eax eax | ExitP ; return 0 Invalid parameter
Test_End
unpcklpd XMM0 XMM0 | pextrw eax XMM0 3 | and eax 07FFF
mov edx 0408F | sub edx eax
sub eax 03C90 | or edx eax
...If edx < 080000000 ; Number is not zero
movupd XMM1 X$SSE_Cv4
movupd XMM6 X$SSE_Shifter4
movupd XMM2 X$SSE_ExpVar1
movupd XMM3 X$SSE_ExpVar2
mulpd XMM1 XMM0 | addpd XMM1 XMM6 | movupd XMM7 XMM1
subpd XMM1 XMM6 | mulpd XMM2 XMM1
mulpd XMM3 XMM1
movupd XMM5 X$SSE_ExpVar4
subpd XMM0 XMM2
movd eax XMM7 | mov ecx eax | and ecx 03F | shl ecx 04 | sar eax 06 | mov edx eax
subpd XMM0 XMM3
movupd XMM2 X$ecx+ExpTable
movupd XMM4 X$SSE_ExpVar3
mulpd XMM4 XMM0
movupd XMM1 XMM0
mulpd XMM0 XMM0
addpd XMM5 XMM4
mulsd XMM0 XMM0
addsd XMM1 XMM2
unpckhpd XMM2 XMM2
movdqu XMM6 X$SSE_Exp_MMask0 | pand XMM7 XMM6
movdqu XMM6 X$SSE_Exp_Bias0 | paddq XMM7 XMM6 | psllq XMM7 46
mulpd XMM0 XMM5
addsd XMM1 XMM0
orpd XMM2 XMM7
unpckhpd XMM0 XMM0
addsd XMM0 XMM1
add edx 037E
..If edx <= 077C ; Adjust 0
mulsd XMM0 XMM2
addsd XMM0 XMM2
mov eax &TRUE
..Else
fstcw W@ControlWord
mov dx W@ControlWord | or dx 0300 | mov W$SSe_Exp_RetVal dx | fldcw W$SSe_Exp_RetVal
mov edx eax | sar eax 1 | sub edx eax
movdqu XMM6 X$SSE_Exp_EMask0
pandn XMM6 XMM2
add eax 03FF | movd XMM3 ax | psllq XMM3 52
orpd XMM6 XMM3
add edx 03FF | movd XMM4 dx | psllq XMM4 52
movlpd X$SSe_Exp_RetVal XMM0 | fld R$SSe_Exp_RetVal
movlpd X$SSE_TmpVar2 XMM0 | fld R$SSE_TmpVar2
fmul ST1 ST0
faddp ST1 ST0
movlpd X$SSe_Exp_RetVal XMM4 | fld R$SSe_Exp_RetVal
fmulp ST1 ST0
fstp R$SSe_Exp_RetVal
movlps XMM0 X$SSe_Exp_RetVal
fldcw W@ControlWord
pextrw ecx XMM0 3 | and ecx 07FF0
If ecx >= 07FF0 ; Overflown
;xor eax eax
mov eax SSE_OVERFLOWN
Else_If ecx = 0 ; Found underflow
;xor eax eax
mov eax SSE_UNDERFLOW
Else
mov eax &TRUE
End_If
..End_If
...Else
; Number seems to be zero. Let´s confirm.
mov eax D@Number | and eax 07FFFFFFF
If eax < 040900000 ; Input is zero. Return 1
addsd XMM0 X$SSE_OneVal0
mov eax &TRUE
Else
; Error. The number can be a underflow or overflow, NAN, INF etc
mov eax 0-1 ; Underflow_Overflow
End_If
...End_If
EndP
If someone can please test to benchmark it would be appreciated :) Attached the masm version ;)
Btw...below is another version (a inaccurate one :rolleyes: :rolleyes: :rolleyes:) that takes only 14 clocks :mrgreen: :mrgreen: :mrgreen: :mrgreen:
[M_LN2_D: R$ 0.72134752044448170367708185223087490666832985327631055065761515174311334481361108009443447714665950725529579950184,
R$ 0.72134752044448170367708185223087490666832985327631055065761515174311334481361108009443447714665950725529579950184] ; 0.5/(ln(2))
[SSE_FastExp_Var1: R$ 3070.95411, R$ 3070.95411]; 3 * 1024.0 - 1.05
(...)
The inputed value must be in xmm0 and..thats it. Only 10 instructions :bgrin: :bgrin: :bgrin:
movups xmm2 X$M_LN2_D
movups xmm1 X$SSE_FastExp_Var1 | mulpd xmm0 xmm2
movups xmm2 xmm1 | addpd xmm2 xmm0
psllq XMM2 11
; now we subtract
subpd xmm1 xmm0
psllq XMM1 11
divpd xmm2 xmm1
movups xmm0 xmm2
This code is adapted from
https://stackoverflow.com/questions/53872265/how-do-you-process-exp-with-sse2
basically, it does only this:
shl 11 ((Number*(0.5/(ln(2))+(3 * 1024.0 - 1.05)))
______________________________________________
shl 11 ((Number*(0.5/(ln(2))-(3 * 1024.0 - 1.05)))
Updated. Attached JJ benchmark test "GugatestExponent.zip"
Guga,
You need to build it so anyone can test it. I downloaded in but its a source of unknown assembler type.
Unknown ?
That´s weird. I thought it was compatible to masm. I´ll ty to port it. :thumbsup:
Just build it so we can test it for you.
Ok. i guess now it will work.
I just see a box saying 3D frames :sad:
try now.
It is only the function code i made. Not a working app. The returned value is stored in the 1st double in xmm0
; #########################################################################
;
; The framing affects are drawn on the client area by a single procedure,
; "Frame3D". In the WmdProc message handling Proc, the WM_PAINT message
; calls another Proc called "Paint_Proc" which contains the procedure calls
; to "Frame3D".
;
; #########################################################################
.686p
.model flat,stdcall
.mmx
.xmm
OPTION CASEMAP:NONE
; #########################################################################
include \masm32\include\windows.inc
include \masm32\include\user32.inc
include \masm32\include\kernel32.inc
include \masm32\include\gdi32.inc
includelib \masm32\lib\user32.lib
includelib \masm32\lib\kernel32.lib
includelib \masm32\lib\gdi32.lib
; #########################################################################
;=============
; Local macros
;=============
szText MACRO Name, Text:VARARG
LOCAL lbl
jmp lbl
Name db Text,0
lbl:
ENDM
m2m MACRO M1, M2
push M2
pop M1
ENDM
return MACRO arg
mov eax, arg
ret
ENDM
;=================
; Local prototypes
;=================
; Sse2_exp_precise PROTO :DWORD,:DWORD
.data?
XMMWORD2 struc
data dq 2 dup (?)
XMMWORD2 ends
.data
szDisplayName db "3D Frames",0
CommandLine dd 0
hWnd dd 0
hInstance dd 0
SSE_Cv4 dq 92.3324826168936567, 92.3324826168936567
; Int64ToDoubleMagic
SSE_Shifter4 dq 6.755399441055744e+15, 6.755399441055744e+15,
8.67361737988403547e-19, 8.67361737988403547e-19
SSE_ExpVar1 dq 1.08304246962234174e-2, 1.08304246962234174e-2
SSE_ExpVar2 dq 2.5728046223276691e-14, 2.5728046223276691e-14
SSE_ExpVar3 dq 8.33216827061673306e-3, 1.66666666698150945e-1
SSE_ExpVar4 dq 4.16667208687284685e-2, 4.99999999999566291e-1
ExpTable dq 0, 0, 2.04582073601169570e-16, 2.42294657314453310e-310
dq 4.99974487227263260e-17, 4.87227730461210550e-310, 7.35784687124741820e-18, 7.34827949904982370e-310
dq 8.18931763819551360e-17, 9.85124358964738520e-310, 1.66658814423267470e-18, 1.23814631722046340e-309
dq 1.34046152845389320e-16, 1.49392350395699130e-309, 1.44126483728110840e-16, 1.75248592164539530e-309
dq 1.75676685628521850e-16, 2.01386389946222700e-309, 4.77695942525622330e-17, 2.27808809684712520e-309
dq 9.34171060990504440e-17, 2.54518950709912420e-309, 4.58567032666235110e-17, 2.81519946101217220e-309
dq 7.82657325863607620e-17, 3.08814963055018380e-309, 2.82378442595106130e-17, 3.36407203256215700e-309
dq 3.29047266460084160e-17, 3.64299903253770950e-309, 4.72136812117012750e-17, 3.92496334840353820e-309
dq 3.34846233362515240e-17, 4.20999805436120270e-309, 5.52755004850524930e-17, 4.49813658476670150e-309
dq 1.19250041407656740e-16, 4.78941273805230130e-309, 1.65291010027318310e-16, 5.08386068069106740e-309
dq 3.75085420130312720e-17, 5.38151495120456640e-309, 1.71528210782905300e-16, 5.68241046421419000e-309
dq 2.10230496752157180e-18, 5.98658251453665700e-309, 1.33575100888345410e-17, 6.29406678132401250e-309
dq 1.95725852931120360e-17, 6.60489933224881300e-309, 1.14594728088946900e-16, 6.91911662773481080e-309
dq 1.45979432294737540e-16, 7.23675552523374920e-309, 6.66380458923219520e-17, 7.55785328354870360e-309
dq 5.68648095791174000e-17, 7.88244756720450720e-309, 7.00787504690699450e-17, 8.21057645086578490e-309
dq 1.11524233151705370e-16, 8.54227842380307460e-309, 8.99959272503798580e-17, 8.87759239440759380e-309
dq 8.86511592917582850e-17, 9.21655769475515650e-309, 1.46901932708742340e-16, 9.55921408521980240e-309
dq 1.32721817677819360e-16, 9.90560175913766330e-309, 1.13655151388858840e-16, 1.02557613475215990e-308
dq 1.26761474172717220e-16, 1.06097339238271920e-308, 9.50689710108795960e-18, 1.09675610087706390e-308
dq 7.97786311075428990e-17, 1.13292845751990740e-308, 7.32795774215157790e-17, 1.16949470530139930e-308
dq 5.15483011707867830e-17, 1.20645913341482120e-308, 2.42539857666898060e-17, 1.24382607775970770e-308
dq 1.34460827434261430e-16, 1.28159992145044320e-308, 7.60136434780041980e-17, 1.31978509533039990e-308
dq 1.53414100536037230e-17, 1.35838607849166940e-308, 9.51550643423588250e-17, 1.39740739880045840e-308
dq 7.33935310510226070e-17, 1.43685363342820510e-308, 3.54094826264618290e-17, 1.47672940938847590e-308
dq 4.87516052622706170e-17, 1.51703940407971140e-308, 1.25886762958400070e-16, 1.55778834583388210e-308
dq 1.18426926175418470e-16, 1.59898101447111830e-308, 1.82140544036225900e-17, 1.64062224186037870e-308
dq 1.68548729062897310e-17, 1.68271691248622430e-308, 3.62161593533689420e-17, 1.72526996402176670e-308
dq 1.01562190116415010e-17, 1.76828638790785260e-308, 6.74378629321924300e-17, 1.81177122993855490e-308
dq 1.79012690760451310e-17, 1.85572959085304300e-308, 5.26537076855627410e-17, 1.90016662693389250e-308
dq 8.58071433248992810e-17, 1.94508755061191610e-308, 1.79639326598330220e-17, 1.99049763107757670e-308
dq 6.04870235399373270e-17, 2.03640219489905820e-308, 5.33680587851415070e-17, 2.08280662664707050e-308
dq 4.57849152770600950e-17, 2.12971636952645360e-308, 2.04142788975783030e-17, 2.17713692601466200e-308
SSE_Exp_MMask0 dd 0FFFFFFC0h, 0, 0FFFFFFC0h, 0
SSE_Exp_Bias0 dq 0FFC0h, 0FFC0h
SSE_OneVal0 dq 1.0
SSE_Exp_EMask0 dq 0FFF00000h, 0FFF00000h
SSe_Exp_RetVal dq 0
SSE_TmpVar2 dq 0
.code
start:
push 1
push 5
call Sse2_exp_precise
invoke ExitProcess, 0
; #########################################################################
Sse2_exp_precise proc
ControlWord = dword ptr -4
Number = dword ptr 8
Flag = dword ptr 0Ch
push ebp
mov ebp, esp
sub esp, 4
push edx
push ecx
mov [ebp+ControlWord], 0
mov eax, [ebp+Flag]
test eax, 1
jz short loc_4CE8D0
cvtsi2sd xmm0, [ebp+Number]
jmp short loc_4CE905
; ---------------------------------------------------------------------------
loc_4CE8D0: ; CODE XREF: Sse2_exp_precise+17?j
test eax, 2
jz short loc_4CE8DE
cvtss2sd xmm0, [ebp+Number]
jmp short loc_4CE905
; ---------------------------------------------------------------------------
loc_4CE8DE: ; CODE XREF: Sse2_exp_precise+25?j
test eax, 4
jz short loc_4CE8EE
mov eax, [ebp+Number]
movsd xmm0, qword ptr [eax]
jmp short loc_4CE905
; ---------------------------------------------------------------------------
loc_4CE8EE: ; CODE XREF: Sse2_exp_precise+33?j
test eax, 8
jz short loc_4CE8FE
mov eax, [ebp+Number]
movq xmm0, qword ptr [eax]
jmp short loc_4CE905
; ---------------------------------------------------------------------------
loc_4CE8FE: ; CODE XREF: Sse2_exp_precise+43?j
xor eax, eax
jmp loc_4CEAE7
; ---------------------------------------------------------------------------
loc_4CE905: ; CODE XREF: Sse2_exp_precise+1E?j
; Sse2_exp_precise+2C?j ...
unpcklpd xmm0, xmm0
pextrw eax, xmm0, 3
and eax, 7FFFh
mov edx, 408Fh
sub edx, eax
sub eax, 3C90h
or edx, eax
cmp edx, 80000000h
jnb loc_4CEAC4
movupd xmm1, SSE_Cv4
movupd xmm6, SSE_Shifter4
movupd xmm2, SSE_ExpVar1
movupd xmm3, SSE_ExpVar2
mulpd xmm1, xmm0
addpd xmm1, xmm6
movupd xmm7, xmm1
subpd xmm1, xmm6
mulpd xmm2, xmm1
mulpd xmm3, xmm1
movupd xmm5, SSE_ExpVar4
subpd xmm0, xmm2
movd eax, xmm7
mov ecx, eax
and ecx, 3Fh
shl ecx, 4
sar eax, 6
mov edx, eax
subpd xmm0, xmm3
movupd xmm2, ExpTable[ecx]
movupd xmm4, SSE_ExpVar3
mulpd xmm4, xmm0
movupd xmm1, xmm0
mulpd xmm0, xmm0
addpd xmm5, xmm4
mulsd xmm0, xmm0
addsd xmm1, xmm2
unpckhpd xmm2, xmm2
movdqu xmm6, xmmword ptr SSE_Exp_MMask0
pand xmm7, xmm6
movdqu xmm6, xmmword ptr SSE_Exp_Bias0
paddq xmm7, xmm6
psllq xmm7, 2Eh
mulpd xmm0, xmm5
addsd xmm1, xmm0
orpd xmm2, xmm7
unpckhpd xmm0, xmm0
addsd xmm0, xmm1
add edx, 37Eh
cmp edx, 77Ch
ja loc_4CEA07
mulsd xmm0, xmm2
addsd xmm0, xmm2
mov eax, 1
jmp loc_4CEABF
; ---------------------------------------------------------------------------
loc_4CEA07: ; CODE XREF: Sse2_exp_precise+13F?j
fstcw word ptr [ebp+ControlWord]
mov dx, word ptr [ebp+ControlWord]
or dx, 300h
mov word ptr SSe_Exp_RetVal, dx
fldcw word ptr SSe_Exp_RetVal
mov edx, eax
sar eax, 1
sub edx, eax
movdqu xmm6, xmmword ptr SSE_Exp_EMask0
pandn xmm6, xmm2
add eax, 3FFh
movd xmm3, eax
psllq xmm3, 34h
orpd xmm6, xmm3
add edx, 3FFh
movd xmm4, edx
psllq xmm4, 34h
movlpd SSe_Exp_RetVal, xmm0
fld SSe_Exp_RetVal
movlpd SSE_TmpVar2, xmm0
fld SSE_TmpVar2
fmul st(1), st
faddp st(1), st
movlpd SSe_Exp_RetVal, xmm4
fld SSe_Exp_RetVal
fmulp st(1), st
fstp SSe_Exp_RetVal
movlps xmm0, SSe_Exp_RetVal
fldcw word ptr [ebp+ControlWord]
pextrw ecx, xmm0, 3
and ecx, 7FF0h
cmp ecx, 7FF0h
jb short loc_4CEAAE
mov eax, 0FFFFFFFEh
jmp short loc_4CEABF
; ---------------------------------------------------------------------------
loc_4CEAAE: ; CODE XREF: Sse2_exp_precise+1F5?j
cmp ecx, 0
jnz short loc_4CEABA
mov eax, 0FFFFFFFFh
jmp short loc_4CEABF
; ---------------------------------------------------------------------------
loc_4CEABA: ; CODE XREF: Sse2_exp_precise+201?j
mov eax, 1
loc_4CEABF: ; CODE XREF: Sse2_exp_precise+152?j
; Sse2_exp_precise+1FC?j ...
jmp loc_4CEAE7
; ---------------------------------------------------------------------------
loc_4CEAC4: ; CODE XREF: Sse2_exp_precise+77?j
mov eax, [ebp+Number]
and eax, 7FFFFFFFh
cmp eax, 40900000h
jnb short loc_4CEAE2
addsd xmm0, SSE_OneVal0
mov eax, 1
jmp short loc_4CEAE7
; ---------------------------------------------------------------------------
loc_4CEAE2: ; CODE XREF: Sse2_exp_precise+221?j
mov eax, 0FFFFFFFFh
loc_4CEAE7: ; CODE XREF: Sse2_exp_precise+50?j
; Sse2_exp_precise:loc_4CEABF?j ...
pop ecx
pop edx
mov esp, ebp
pop ebp
retn 8
Sse2_exp_precise endp
; #########################################################################
end start
It looks like you used the asm output of a compiler. What does...
push 1
push 5
call Sse2_exp_precise
... perform? What are the parameters? What is the expected output? I get 148.4131591 in the low double of xmm0, does that make sense, and how?
What exactly is an "exp function"? I consider this (http://www.jj2007.eu/MasmBasicQuickReference.htm#Mb1444)an "exp function":
include \masm32\MasmBasic\MasmBasic.inc
MyDouble double ?
Init
ExpXY(4, 0.5, MyDouble)
Inkey Str$("4^0.5=%9f", MyDouble) ; Output: 4^0.5=2.00000000
EndOfCode
P.S., to make your source assemble with UAsm:
movupd xmm6, OWORD ptr SSE_Shifter4
movupd xmm2, OWORD ptr ExpTable[ecx]
Hi JJ
Yes, when you input 5 as an integer value, the result is 148.4131591....
The parameters are:
1st Parameter = The value to be calculated. It can be a integer, a Float or a address to Real8 (or qword)
2nd Parameter - The flag to be used on each type of input. I build a set of equates for that.
SSE_EXP_INT equ 1
SSE_EXP_FLOAT equ 2
SSE_EXP_REAL8 equ 4
SSE_EXP_QWORD equ 8
So, if the input is a integer you must use
invoke Sse2_exp_precise, 5, 1
or
invoke Sse2_exp_precise, 5, SSE_EXP_INT
If the input if a Float (Real4) you use:
MyValue Real4 5.0 ; I believe this is the masm syntax for Float, right ?
invoke Sse2_exp_precise, [MyValue], 2
or
invoke Sse2_exp_precise, [MyValue], SSE_EXP_FLOAT
If the input is a Real8, you must use as a input the address (Offset) of the Value, like this:
MyValue Real8 5.0
invoke Sse2_exp_precise, offset MyValue, 4
or
MyValue Real8 5.0
invoke Sse2_exp_precise, offset MyValue, SSE_EXP_REAL8
And, the same for a Qword (int64). I don´t remember what is the syntax for a Qword in masm. But it is for integers in 64 bits.
It works for either positive or negative values.
Exp is the exponencial function.
exp(5) = 148.4131591025766034211155800405522796234876675938789890467
on mine version it is close (If not, equal)
exp(5) = 148.4131591...
I was trying to make it work with masmbasic, but i don´t know how to setup the output for ir display the result 148.41....
I tried this:
push esi
push loops-1
NanoTimer()
push 1
push 5
call Sse2_exp_precise
; cvtsd2ss eax, xmm0 ; convert to int
cvtsd2ss xmm0, xmm0
; cvtsd2si eax, xmm0
; fstp tmp8
Print Str$("## SSe_2_exp_Precise result Exp(5) = %f", eax), Str$("<>%i ##\n\n", esi)
But...got nowehere. I only got the amount of time the function took to work. (Attached file). It assembled on masmbasic and worked, but i don´t know how to set a variable in masm basic to see the result.
Here it is - your source, formatted and ready to launch, Str$(f:xmm0) did the job ;-)
REPEAT 3
NanoTimer()
push loops-1
.Repeat
push 1 ;
push 5
call Sse2_exp_precise ; invoke Sse2_exp_precise, 5, 1
dec stack
.Until Sign?
pop edx
Print Str$("%i Mio Sse2_exp_precise iterations took ", loops/1000000), NanoTimer$(), Str$(" - result Exp(5) = %9f\n", f:xmm0)
NanoTimer()
push loops-1
.Repeat
ExpE(5, tmp8)
dec stack
.Until Sign?
pop edx
Print Str$("%i Mio MasmBasic ExpE() iterations took ", loops/1000000), NanoTimer$(), Str$(" - result Exp(5) = %9f\n\n", tmp8)
ENDM
You made me discover a bug in MasmBasic, therefore it will build correctly only with MB version 12 August 2020 (http://masm32.com/board/index.php?topic=94.0) :tongue:
Actually, it was not really a bug but...
MbExp PROTO
if 0
ffree st(7) ; MbExp[-12]
fldl2e
jmp @F
Mb10X: ffree st(7) ; MbExp[-6]
fldl2t
@@: fmul
endif
align 4
MbExp proc
sub esp, 3*DWORD
r10 equ stack
mov eax, 3fffh ; credits to bitRake (http://www.asmcommunity.net/forums/topic/?id=2979#post-30586), qWord & Agner Fog (http://masm32.com/board/index.php?topic=1783.msg18292#msg18292)
xor edx, edx
What happened here is that a few years ago, when looking at this, I didn't quite understand what this code did (April 2013 - oooold (http://masm32.com/board/index.php?topic=94.msg18468#msg18468)), so I commented it out with the if 0, and from then on a call MbExp[-12] ended up in No Man's Land. A clear case of not enough comments :cool:
Timmings Result
JJ, can you please see if i ported it correctly this time? I was thinking a bit weird the result be that slow and gave a try with another timming example you made using Dave routine called "f8Exp". I replaced his version with mine to test the speed of the algo
Can you please see if the porting is ok, and also try to force the result to export from xmm0 to MyReal8 to we also check the resultant value and the timmings ?
Here, the timmings are now Ok. They fits to the results i´ve got earlier.
AMD Ryzen 5 2400G with Radeon Vega Graphics (SSE4)
2076 cycles for 100 * Sse2_exp_precise (Guga SSE2 Exp precise , 2.7182818^5)
11193 cycles for 100 * ExpXY (MasmBasic, 2.7182818^5)
25992 cycles for 100 * pow (CRT, 2.7182818^5)
2024 cycles for 100 * Sse2_exp_precise (Guga SSE2 Exp precise , 2.7182818^5)
11148 cycles for 100 * ExpXY (MasmBasic, 2.7182818^5)
29457 cycles for 100 * pow (CRT, 2.7182818^5)
2039 cycles for 100 * Sse2_exp_precise (Guga SSE2 Exp precise , 2.7182818^5)
11639 cycles for 100 * ExpXY (MasmBasic, 2.7182818^5)
25994 cycles for 100 * pow (CRT, 2.7182818^5)
2065 cycles for 100 * Sse2_exp_precise (Guga SSE2 Exp precise , 2.7182818^5)
11204 cycles for 100 * ExpXY (MasmBasic, 2.7182818^5)
27472 cycles for 100 * pow (CRT, 2.7182818^5)
2033 cycles for 100 * Sse2_exp_precise (Guga SSE2 Exp precise , 2.7182818^5)
12207 cycles for 100 * ExpXY (MasmBasic, 2.7182818^5)
25956 cycles for 100 * pow (CRT, 2.7182818^5)
148.413159102577 for Sse2_exp_precise (Guga SSE2 Exp precise , 2.7182818^5)
148.413159102577 for ExpXY (MasmBasic, 2.7182818^5)
148.413159102577 for pow (CRT, 2.7182818^5)
--- ok ---
If the test is ok, you guys can use the function if needed, and also change it to it be more readable using the correct masm macros etc. I finished a similar routine for calculating a fast Logarithm and fast pow of a number.I should post it soon if the tests of this sse2 exp precise are ok. :bgrin:
Quote from: guga on August 12, 2020, 03:05:15 PMJJ, can you please see if i ported it correctly this time?
I got 30 times "error A2071:initializer magnitude too large for specified size" :sad:
Then I switched to UAsm64 (see OPT_Assembler at the end), and everything was ok :biggrin: :thumbsup:
@Nidud: with AsmC /Znk I get that error A2071 :sad:
Congrats, Guga now it is definitely very fast - what is so different now?
Intel(R) Core(TM) i5-2450M CPU @ 2.50GHz (SSE4)
2669 cycles for 100 * Sse2_exp_precise (Guga SSE2 Exp precise , 2.7182818^5)
10020 cycles for 100 * ExpXY (MasmBasic, 2.7182818^5)
20399 cycles for 100 * pow (CRT, 2.7182818^5)
Thanks for the offer to use your code. I will do so if I need a really fast exponential function. For the library, I hesitate, because I am already fighting with the bloat ;-)
1788 bytes for Sse2_exp_precise (Guga SSE2 Exp precise , 2.7182818^5)
29 bytes for ExpXY (MasmBasic, 2.7182818^5)
39 bytes for pow (CRT, 2.7182818^5)
Quote from: jj2007 on August 12, 2020, 06:09:18 PM
Quote from: guga on August 12, 2020, 03:05:15 PMJJ, can you please see if i ported it correctly this time?
I got 30 times "error A2071:initializer magnitude too large for specified size" :sad:
Then I switched to UAsm64 (see OPT_Assembler at the end), and everything was ok :biggrin: :thumbsup:
@Nidud: with AsmC /Znk I get that error A2071 :sad:
Congrats, Guga now it is definitely very fast - what is so different now?
Intel(R) Core(TM) i5-2450M CPU @ 2.50GHz (SSE4)
2669 cycles for 100 * Sse2_exp_precise (Guga SSE2 Exp precise , 2.7182818^5)
10020 cycles for 100 * ExpXY (MasmBasic, 2.7182818^5)
20399 cycles for 100 * pow (CRT, 2.7182818^5)
Thanks for the offer to use your code. I will do so if I need a really fast exponential function. For the library, I hesitate, because I am already fighting with the bloat ;-)
1788 bytes for Sse2_exp_precise (Guga SSE2 Exp precise , 2.7182818^5)
29 bytes for ExpXY (MasmBasic, 2.7182818^5)
39 bytes for pow (CRT, 2.7182818^5)
I simply updated masm basic and used the same routine you did on another post for dedndave (TimingsPow.asc). I assembled it with masm and masmbasic.
About the difference of speed, i didn´t changed the previous code, i simply used the correct base and set the parameters according. This time i made a test using Real8 values because the other functions were also being tested with Real 8.
The parameters for the function was writen as:
push 4 ; The value "4" we use to compute Real8 as i explained on the previous post.
push offset MyExpo ; Pointer to Real8 or to a Qword only when we are using 64 bits values. For 32 Bits values (dword, Float etc), we use the value directly because int or Float are not pointers. (So, without the "offset" thing in masm)
call Sse2_exp_precise
The algorithm works
only for the exponential of a number (not general power of a number). So, i mainly used the
Euler number as a base like this:
MyBaseA REAL8 2.71828182845904523536028747135
MyBaseB REAL8 2.71828182845904523536028747135
MyBaseC REAL8 2.71828182845904523536028747135
https://en.wikipedia.org/wiki/Euler%27s_formula
Your others functions on your example were calculating "pow", so i mainly applied all of them, to use the same base as my algo to compare the results.
I´ll later post other function that calculates a general pow of a number that seems to be as fast as this one and we can then test the results :thumbsup:
I´m pretty sure there´s more room for optimization and also avoiding it to use all xmm register. I would prefer the algo was using only xmm0, xmm1 or xmm2, but i couldn´t suceed to limit the numbers of SSE registers being used yet.
The function was originally inside ucrtbase,dll in windows10 . Its a undocumented function, but the original code is a true mess and a bit slower then mine. So, i adapted it to work as expected, but instead using only xmm0 as input, i enabled the possibility to the user inputs whatever number format he wants (integer, floats, doubles etc) and also returns in eax error cases. MY main concern is about precision.The algo has a better precision then other libraries, but i don´t know in how many digits yet it starts loosing precision.
Currently, i´m inserting the function to be used as dll and making a small guide for it and allows teh user to also export the result on whatever way he wants (If it wouldn´t affect the general speed, of course)
Btw...if you want to play and have fun with speed, you may give a try on the other piece of code i provided as an example in the 1st post. Just use xmm0 as input and Voilà :mrgreen: :mrgreen: :mrgreen:. This piece of code below is speed as possible (even faster then mine),
but it completely lacks precision.
M_LN2_D Real8 0.72134752044448170367708185223087490666832985327631055065761515174311334481361108009443447714665950725529579950184,
Real8 0.72134752044448170367708185223087490666832985327631055065761515174311334481361108009443447714665950725529579950184 ; 0.5/(ln(2))
SSE_FastExp_Var1 Real8 3070.95411, Real8 3070.95411 ; 3 * 1024.0 - 1.05
"Real8" is the proper masm syntax to define a Double (64 Bits) floating values, right ???
(...)
movups xmm2, X$M_LN2_D ; X$ is Rosasm syntax for xmmword.
movups xmm1, X$SSE_FastExp_Var1
mulpd xmm0, xmm2 ; The input number must be in xmm0
movups xmm2, xmm1
addpd xmm2, xmm0
psllq XMM2, 11
; now we subtract
subpd xmm1, xmm0
psllq XMM1, 11
divpd xmm2, xmm1
movups xmm0, xmm2
Very nice work, Guga :thumbsup:
I've put together a little testbed for examining the pattern of the exponential function, see attachment. My suspicion is that it should be possible to calculate the mantissa from a table, and to apply the exponent afterwards. Maybe following the Fast Sinus() and Cosinus() algos (http://masm32.com/board/index.php?topic=4580.0) path :cool:
But math is my weak point. What do you see there?
Quote from: jj2007 on August 12, 2020, 07:56:23 PM
Very nice work, Guga :thumbsup:
I've put together a little testbed for examining the pattern of the exponential function, see attachment. My suspicion is that it should be possible to calculate the mantissa from a table, and to apply the exponent afterwards. Maybe following the Fast Sinus() and Cosinus() algos (http://masm32.com/board/index.php?topic=4580.0) path :cool:
But math is my weak point. What do you see there?
Hi JJ
I see this:
(https://i.ibb.co/3kSxshY/Image1.png) (https://ibb.co/WsBr3Kp)
Btw..how you managed to create this curve window ? I like that a lot :thumbsup: :thumbsup: :thumbsup: :thumbsup:
About the fastsin, cos algos in the post, i´ll take a look today. Calculating the mantissa from a table, should indeed be the way to go. But, i have no ide how to do it. My knowledge in math is also a bit limited. I´m not used to those complex math symbols anymore. Some i have to push from memory when i tried the engineering university when i was a young (Before i entered to law school, i tried engineering, but i quit. Did mainly 3 years or something.), but...it´s hard because i´m talking of something i was studying more then 30 years ago :greensml: :greensml: :greensml: :greensml: :greensml:
Btw...The precision starts loosing after the 6th or 7th digit i suppose. In Rosasm i can see untill 14 digits so it´s easier to check when they are different, but on your test, it have less digits to compare
Quote from: guga on August 12, 2020, 08:45:29 PMI see this:
I actually meant "what kind of pattern do you see, and how could we use it to fumble mantissa and exponent"
QuoteBtw..how you managed to create this curve window ? I like that a lot :thumbsup: :thumbsup: :thumbsup: :thumbsup:
See ArrayPlot in the help file. All you need is an array:
Event Paint
ArrayPlot MyData(), RgbCol(200, 255, 255, 0), lines=4
ArrayPlot exit, "Exponential function"
Hi Guga,
You said: MY main concern is about precision.
So, a very fast Chebyshev polynomial approximation is out of the question?
For what purpose do you need the Exp function?
Quote from: Siekmanski on August 12, 2020, 09:05:31 PM
Hi Guga,
You said: MY main concern is about precision.
So, a very fast Chebyshev polynomial approximation is out of the question?
For what purpose do you need the Exp function?
Hi Marinus
It depends of the level of precision of the Chebyshev polynomial. I think it´s usefull for what we are doing. Even considering some loss of precision, it seems to me, it won´t affect the final result.
The exp functon i was using to test that W Lambert function i told on the other post. Even if we won´t use it for the watermark r emover, we can use for other puposes in other image filters that we could make. I gave a try on that laplace 2D algorithm that used this exp and pow functions to compute the W function. But, after you explained, i understood better this laplace thing, but i wwant to give a try calculating the sigma as a standard deviation of the whole image and see if we will need this Lambert algoithm or not.
But, even if we wouldn´t need it we can use faster pow and exp to build other algorithms such as a faster (and more accurate) CieLab for example or the other algorithm i tried last year (HSM or something)
Quote from: jj2007 on August 12, 2020, 09:04:50 PM
Quote from: guga on August 12, 2020, 08:45:29 PMI see this:
I actually meant "what kind of pattern do you see, and how could we use it to fumble mantissa and exponent"
QuoteBtw..how you managed to create this curve window ? I like that a lot :thumbsup: :thumbsup: :thumbsup: :thumbsup:
See ArrayPlot in the help file. All you need is an array:
Event Paint
ArrayPlot MyData(), RgbCol(200, 255, 255, 0), lines=4
ArrayPlot exit, "Exponential function"
Hi JJ
You mean, the result ?
I took a look at the result and compared with the same value as in wolframalpha and...for my surprise, my algo seems to be 100% accurate (At least until the 13th digit) ! :dazzled: :dazzled: :dazzled: :dazzled:
These are the results i´ve got !
When using the input as
Real8 i see this number:
exp(5) = 148.4131591025766
and in wolframalpha, the result is:
exp(5) = 148.4131591025766034211155800405522796234876675938789890467...
exp(-5) = 6.737946999085467e-3
in wolframalpha results in:
exp(-5) = 6.737946999085467096636048423148424248849585027355085430e-3
When i use the input as
Real4(Float), i see this:
exp(5) = 148.4131591025766
exp(-5) = 6.737946999085467e-3
When i use the input as
int, i see this:
exp(5) = 148.4131591025766
exp(-5) = 6.737946999085467e-3
This really surprises me, because on my initial tests, the original version from windows10 had a lack of precision after the 6th or 7th digit, but, somehow i managed to fix this damn algo, regardless the input format. So, even a int or Real4 value will result on a precise value without loss :greensml: :greensml: :greensml: :greensml: :greensml:
deleted
Quote from: nidud on August 13, 2020, 01:29:54 AM
@Nidud: with AsmC /Znk I get that error A2071 :sad:
Quote
Well, it is an erroneous number according to the specification used:
dq 2.42294657314453310e-310
...
However, ML and ML64 version 14 do not produce this error.
Indeed. And Olly reports 2.4229465731445330820e-310 in ST(0) after a
fld with that number.
include \masm32\MasmBasic\MasmBasic.inc
ExpTable dq 2.42294657314453310e-310 ; invalid number for AsmC
Init
fld ExpTable
fld FP10(1.0e310)
fmul
PrintLine "expected: 2.42294657314453310"
PrintLine Str$("obtained: %Jf", ST(0)v)
EndOfCode
Output:
expected: 2.42294657314453310
obtained: 2.422946573144533082
Using 2.4e-320, a certain loss of precision creeps in:
expected: 2.42294657314453310
obtained: 2.422897927205473053
Quote from: nidud on August 13, 2020, 01:29:54 AM
DBL_MAX equ 1.7976931348623158e+308
DBL_MIN equ 2.2250738585072014e-308
That is Normalized range.
Intel specification say that only 80 bit denormalized values can be loaded without to raise an exception... I don't know, here I have an AMD :biggrin:
Yep, but the hardware allows much more: at e-320 you still have over 4 digits of precision, more than enough for most purposes.
deleted
http://www.website.masmforum.com/tutorials/fptute/fpuchap4.htm
QuoteIf the source is a denormalized number, a Denormal exception will be detected, setting the related flag in the Status Word. The value will still be loaded and normalized if possible.
This is exactly what you can observe with Olly when loading from somevar REAL8 2.42294657314453310e-320 - the D flag is set. The question is how to deal with it. IMHO UAsm does it right: accept denormal values. If a programmer insists to work in this exotic range, he should know how to handle the loss of precision. A warning would be ok, though.
:thumbsup:
deleted
Quote from: nidud on August 13, 2020, 09:59:25 AMI'm a bitu sceptical about this approach thought
The hardware can handle the value, but AsmC refuses to accept it? Let's use UAsm then.
Seriously: if the risk of setting the Underflow flag bothers you, issue a warning. The case is so exotic that I wouldn't worry at all.
deleted
Quote from: HSE on August 13, 2020, 05:12:28 AM
Quote from: nidud on August 13, 2020, 01:29:54 AM
DBL_MAX equ 1.7976931348623158e+308
DBL_MIN equ 2.2250738585072014e-308
That is Normalized range.
Intel specification say that only 80 bit denormalized values can be loaded without to raise an exception... I don't know, here I have an AMD :biggrin:
Tks you so much for the equates HSE
Quote from: nidud on August 13, 2020, 11:43:27 AMThe assembler don't use the FPU (or SSE) so it doesn't set the underflow flag but rather the function that convert the string sets errno to ERANGE. I assume the changes in Masm (and Uasm) has to do with external library functions rather than any deliberate changes to the assembler.
errno = 0; /* v2.11: init errno; errno is set on over- and under-flow */
double_value = strtod( inp, NULL );
Of course it's not a runtime check. When you try to assemble
somebyte db 256, the assembler checks the bounds and throws an error. Same for
somedouble REAL8 1.0e-400 - that's an error, and the assembler knows it. But there is a grey area in this case due to the capacity of the FPU to load "denormalised" values beyond 2.22e-308. They behave like legit numbers, so those who need that (a very exotic case, but we stumbled over it in this thread) should be able to use this feature.
Therefore my proposal: issue a
warning for the grey area. Maybe only once, I hate long listings of identical warnings.
At present, you get
fatal error A1012: error count exceeds 100; stopping assembly, and that clearly is not necessary, given that the code runs just fine when assembled with UAasm.
Sheer luxury would be a one-liner:
W1000: 112 denormalised numbers between lines 14 and 39 :cool:
Hi guga,
Very fast real4 SSE2 Exponent routine.
Up to 4 exponents at once.
Precision: 7 digits
.const
Chebylog2E real4 4 dup (1.44269504089)
; 5th Degree Chebyshev ( Remez Algorithm ) Polynomials
Cheby5Exp0 real4 4 dup (0.0018775767)
Cheby5Exp1 real4 4 dup (0.0089893397)
Cheby5Exp2 real4 4 dup (0.055826318)
Cheby5Exp3 real4 4 dup (0.24015361)
Cheby5Exp4 real4 4 dup (0.69315308)
Cheby5Exp5 real4 4 dup (0.99999994)
.code
align 16
SSE2_Exp: ; in: xmm0, out: xmm0
movaps xmm2,oword ptr Chebylog2E
mulps xmm2,xmm0
psrld xmm0,31
cvttps2dq xmm1,xmm2
psubd xmm1,xmm0
movdqa xmm0,xmm1
cvtdq2ps xmm1,xmm1
subps xmm2,xmm1
movaps xmm1,oword ptr Cheby5Exp0
pslld xmm0,23
mulps xmm1,xmm2
addps xmm1,oword ptr Cheby5Exp1
mulps xmm1,xmm2
addps xmm1,oword ptr Cheby5Exp2
mulps xmm1,xmm2
addps xmm1,oword ptr Cheby5Exp3
mulps xmm1,xmm2
addps xmm1,oword ptr Cheby5Exp4
mulps xmm1,xmm2
addps xmm1,oword ptr Cheby5Exp5
paddd xmm0,xmm1
ret
deleted
SSE4_1 version, faster than the SSE2 version.
align 16
SSE41_Exp: ; in: xmm0, out: xmm0
mulps xmm0,oword ptr Chebylog2E
roundps xmm1,xmm0,1
subps xmm0,xmm1
cvtps2dq xmm2,xmm1
pslld xmm2,23
movaps xmm1,xmm0
mulps xmm0,oword ptr Cheby5Exp0
addps xmm0,oword ptr Cheby5Exp1
mulps xmm0,xmm1
addps xmm0,oword ptr Cheby5Exp2
mulps xmm0,xmm1
addps xmm0,oword ptr Cheby5Exp3
mulps xmm0,xmm1
addps xmm0,oword ptr Cheby5Exp4
mulps xmm0,xmm1
addps xmm0,oword ptr Cheby5Exp5
paddd xmm0,xmm2
ret
@nidud: It seems you just want to argue. Frustrated? Can we help you?
include \masm32\MasmBasic\MasmBasic.inc
NotSoSmall REAL8 1.234567890123456e-308 ; legal for all assemblers, even AsmC
VerySmall REAL8 1.234567890123456e-322 ; legal for UAsm and MASM 14+
; TooSmall REAL8 1.234567890123456e-324 ; illegal even for UAsm
Result REAL8 ?
Init
fld NotSoSmall
fmul FP8(1.0e154)
fmul FP8(1.0e154)
fstp Result
Print Str$("NotSoSmall*1e308=\t%Jf\n", Result)
fld VerySmall
fmul FP8(1.0e161)
fmul FP8(1.0e161)
fstp Result
Inkey Str$("VerySmall*1e322=\t%Jf\n", Result)
EndOfCode
Output:
NotSoSmall*1e308= 1.234567890123456246
VerySmall*1e322= 1.235164114603116481
Quote from: Siekmanski on August 14, 2020, 01:03:38 AM
Hi guga,
Very fast real4 SSE2 Exponent routine.
Up to 4 exponents at once.
Precision: 7 digits
.const
Chebylog2E real4 4 dup (1.44269504089)
; 5th Degree Chebyshev ( Remez Algorithm ) Polynomials
Cheby5Exp0 real4 4 dup (0.0018775767)
Cheby5Exp1 real4 4 dup (0.0089893397)
Cheby5Exp2 real4 4 dup (0.055826318)
Cheby5Exp3 real4 4 dup (0.24015361)
Cheby5Exp4 real4 4 dup (0.69315308)
Cheby5Exp5 real4 4 dup (0.99999994)
.code
align 16
SSE2_Exp: ; in: xmm0, out: xmm0
movaps xmm2,oword ptr Chebylog2E
mulps xmm2,xmm0
psrld xmm0,31
cvttps2dq xmm1,xmm2
psubd xmm1,xmm0
movdqa xmm0,xmm1
cvtdq2ps xmm1,xmm1
subps xmm2,xmm1
movaps xmm1,oword ptr Cheby5Exp0
pslld xmm0,23
mulps xmm1,xmm2
addps xmm1,oword ptr Cheby5Exp1
mulps xmm1,xmm2
addps xmm1,oword ptr Cheby5Exp2
mulps xmm1,xmm2
addps xmm1,oword ptr Cheby5Exp3
mulps xmm1,xmm2
addps xmm1,oword ptr Cheby5Exp4
mulps xmm1,xmm2
addps xmm1,oword ptr Cheby5Exp5
paddd xmm0,xmm1
ret
Great ! A couple of questions:
1 - How did you got those numbers of ChebyXXX ? How to calculate this cheby values ?
2 - Can we improve the precision to the 13Th digit ? What is the cost in terms of speed ?
3 - what if we extend it to a 10th Degree Chebyshev ( Remez Algorithm ) Polynomials ? Is it possible to grant more accuracy ?
Hi guga,
I had the calculated polynomials already on my computer for a few years.
You can create them with the program called LolRemez.
It's open source.
https://github.com/samhocevar/lolremez
Maybe somebody here on the forum would be so kind to compile the program for us in Visual Studio? :eusa_pray:
https://github.com/samhocevar/lolremez/releases/download/v0.6/lolremez-0.6.zip
I guess i did it :thumbsup:
See if it is working as expected, please. I had to install github on the desktop to make this thing compile :mrgreen: :mrgreen: :mrgreen:
:thumbsup:
lolremez -d 10 -r 0:1 "2**x" "2**x"
// Approximation of f(x) = 2**x
// with weight function g(x) = 2**x
// on interval [ 0, 1 ]
// with a polynomial of degree 10.
// p(x)=(((((((((9.9522144894077596e-9*x+9.4609455637620191e-8)*x+1.331271998
894e-6)*x+1.5244659656760603e-5)*x+1.5403957490414841e-4)*x+1.333354373097474
3)*x+9.6181294091749148e-3)*x+5.5504108628244985e-2)*x+2.402265069613608e-1)*
.9314718055989127e-1)*x+1.0000000000000002
double f(double x)
{
double u = 9.9522144894077596e-9;
u = u * x + 9.4609455637620191e-8;
u = u * x + 1.331271998884894e-6;
u = u * x + 1.5244659656760603e-5;
u = u * x + 1.5403957490414841e-4;
u = u * x + 1.3333543730974749e-3;
u = u * x + 9.6181294091749148e-3;
u = u * x + 5.5504108628244985e-2;
u = u * x + 2.402265069613608e-1;
u = u * x + 6.9314718055989127e-1;
return u * x + 1.0000000000000002;
}
deleted
Quote from: Siekmanski on August 14, 2020, 06:16:40 AM
:thumbsup:
lolremez -d 10 -r 0:1 "2**x" "2**x"
// Approximation of f(x) = 2**x
// with weight function g(x) = 2**x
// on interval [ 0, 1 ]
// with a polynomial of degree 10.
// p(x)=(((((((((9.9522144894077596e-9*x+9.4609455637620191e-8)*x+1.331271998
894e-6)*x+1.5244659656760603e-5)*x+1.5403957490414841e-4)*x+1.333354373097474
3)*x+9.6181294091749148e-3)*x+5.5504108628244985e-2)*x+2.402265069613608e-1)*
.9314718055989127e-1)*x+1.0000000000000002
double f(double x)
{
double u = 9.9522144894077596e-9;
u = u * x + 9.4609455637620191e-8;
u = u * x + 1.331271998884894e-6;
u = u * x + 1.5244659656760603e-5;
u = u * x + 1.5403957490414841e-4;
u = u * x + 1.3333543730974749e-3;
u = u * x + 9.6181294091749148e-3;
u = u * x + 5.5504108628244985e-2;
u = u * x + 2.402265069613608e-1;
u = u * x + 6.9314718055989127e-1;
return u * x + 1.0000000000000002;
}
Hi Marinus
Thanks... :thumbsup: :thumbsup: :thumbsup:
This is what i would ask yo8u When do you know which range to use ?
I mean, on https://github.com/samhocevar/lolremez/wiki/Tutorial-1-of-5%3A-exp%28x%29-the-quick-way It shows a example of a exp(x) function with a range of -1 to 1. Yours is a range of 0 to 1. So, i don´t get it. We need to use the value of x always normalized ???
Because even when we insert a range of 0 to 1 with exp(x), the values are different from the ones you posted.
Btw... why did you inputed "2**x" "2**x" to calculate exp(x) ?
2**x is just the notation for the Exponent function, the second one is the weight function.
Just as you, I have to find my way in this LolRemez program. :thumbsup:
The example -1 to 1 is often used in audio.
You stay always in that range, so you don't have compensation code for numbers outside that range.
I just tried to get the polynomials I used in my SSE2 example.
They are almost the same.
I will use the new ones. ( should be more precise since LolRemez has evolved to version 0.6 :cool:)
lolremez -d 5 -r 0:1 "2**x" "2**x"
// Approximation of f(x) = 2**x
// with weight function g(x) = 2**x
// on interval [ 0, 1 ]
// with a polynomial of degree 5.
// p(x)=((((1.8775766751914795e-3*x+8.9893400904946651e-3)*x+5.5826318053295672e
-2)*x+2.401536170443754e-1)*x+6.9315307320016896e-1)*x+9.9999992506352618e-1
double f(double x)
{
double u = 1.8775766751914795e-3;
u = u * x + 8.9893400904946651e-3;
u = u * x + 5.5826318053295672e-2;
u = u * x + 2.401536170443754e-1;
u = u * x + 6.9315307320016896e-1;
return u * x + 9.9999992506352618e-1;
fld FP4(0.5) ; x
fld st
fld st
fld st
fld st
fmul FP8(1.8775766751914795e-3)
fadd FP8(8.9893400904946651e-3)
fmul
fadd FP8(5.5826318053295672e-2)
fmul
fadd FP8(2.401536170443754e-1)
fmul
fadd FP8(6.9315307320016896e-1)
fmul
fadd FP8(9.9999992506352618e-1) ; result on FPU
PrintLine Str$("2^0.5=\t %Jf", ST(0)v), CrLf$, "expected: 1.414213562373095048"
Output:
2^0.5= 1.414213663708122130
expected: 1.414213562373095048
Guga,
Thank you for compiling this piece of GOLD for me in Visual Studio.
LolRemez is really a very valuable program.
It opens doors to get very fast trigonometric functions with polynomials exactly the way I want them.
Hi Jochen,
This is cool stuff or not?
Now we can adjust the precision to real4 or real8 just by finding the correct degree of polynomials.
Jochen,
You calculated Exp2
Exp2(0.5) is equal to sqrt(2.0)
It calculates the value of 2 raised to the power of x
Here is the FPU function for Exp
It calculates the value of e (the base of natural logarithms) raised to the power of x
FPU_Exp:
; st(0) == x
fldl2e
fmulp st(1), st(0)
fld st(0)
frndint
fsub st(1), st(0)
fxch
f2xm1
fld1
faddp
fscale
fstp st(1)
ret
result: 1.648721270700 ( for 0.5 )
Quote from: Siekmanski on August 14, 2020, 08:32:19 AM
Hi Jochen,
This is cool stuff or not?
Yessss :thumbsup:
Quote from: Siekmanski on August 14, 2020, 08:48:42 AM
You calculated Exp2
Exp2(0.5) is equal to sqrt(2.0)
Indeed. Normally I use
Print Str$("2^0.5=%Jf\n", Exp2 (http://www.jj2007.eu/MasmBasicQuickReference.htm#Mb1193)(0.5)v). Output:
2^0.5=1.414213562373095049 :smiley:
Quote from: Siekmanski on August 14, 2020, 07:08:51 AM
2**x is just the notation for the Exponent function, the second one is the weight function.
Just as you, I have to find my way in this LolRemez program. :thumbsup:
The example -1 to 1 is often used in audio.
You stay always in that range, so you don't have compensation code for numbers outside that range.
I just tried to get the polynomials I used in my SSE2 example.
They are almost the same.
I will use the new ones. ( should be more precise since LolRemez has evolved to version 0.6 :cool:)
lolremez -d 5 -r 0:1 "2**x" "2**x"
// Approximation of f(x) = 2**x
// with weight function g(x) = 2**x
// on interval [ 0, 1 ]
// with a polynomial of degree 5.
// p(x)=((((1.8775766751914795e-3*x+8.9893400904946651e-3)*x+5.5826318053295672e
-2)*x+2.401536170443754e-1)*x+6.9315307320016896e-1)*x+9.9999992506352618e-1
double f(double x)
{
double u = 1.8775766751914795e-3;
u = u * x + 8.9893400904946651e-3;
u = u * x + 5.5826318053295672e-2;
u = u * x + 2.401536170443754e-1;
u = u * x + 6.9315307320016896e-1;
return u * x + 9.9999992506352618e-1;
Hi Marinus. You´re welcome :thumbsup: :thumbsup: :thumbsup:
Btw,, i still don´t get this app to work properly.
I tried the 10th term and did this:
lolremez -d 10 -r 0:1 "2**x" "2**x"
It created this polynomials
// p(x)=(((((((((9.9522144894077596e-9*x+9.4609455637620191e-8)*x+1.331271998884894e-6)*x+1.5244659656760603e-5)*x+1.5403957490414841e-4)*x+1.3333543730974749e-3)*x+9.6181294091749148e-3)*x+5.5504108628244985e-2)*x+2.402265069613608e-1)*x+6.9314718055989127e-1)*x+1.0000000000000002
double f(double x)
{
double u = 9.9522144894077596e-9;
u = u * x + 9.4609455637620191e-8;
u = u * x + 1.331271998884894e-6;
u = u * x + 1.5244659656760603e-5;
u = u * x + 1.5403957490414841e-4;
u = u * x + 1.3333543730974749e-3;
u = u * x + 9.6181294091749148e-3;
u = u * x + 5.5504108628244985e-2;
u = u * x + 2.402265069613608e-1;
u = u * x + 6.9314718055989127e-1;
return u * x + 1.0000000000000002;
}
But, when i tried to put them onto a function, it failed miserably. I don´t understand what i´m doing wrong. All i did was:
[<16 Cheby10Exp0: R$ 9.9522144894077596e-9, R$ 9.9522144894077596e-9]
[<16 Cheby10Exp1: R$ 9.4609455637620191e-8, R$ 9.4609455637620191e-8]
[<16 Cheby10Exp2: R$ 1.331271998884894e-6, R$ 1.331271998884894e-6]
[<16 Cheby10Exp3: R$ 1.5244659656760603e-5, R$ 1.5244659656760603e-5]
[<16 Cheby10Exp4: R$ 1.5403957490414841e-4, R$ 1.5403957490414841e-4]
[<16 Cheby10Exp5: R$ 1.3333543730974749e-3, R$ 1.3333543730974749e-3]
[<16 Cheby10Exp6: R$ 9.6181294091749148e-3, R$ 9.6181294091749148e-3]
[<16 Cheby10Exp7: R$ 5.5504108628244985e-2, R$ 5.5504108628244985e-2]
[<16 Cheby10Exp8: R$ 2.402265069613608e-1, R$ 2.402265069613608e-1]
[<16 Cheby10Exp9: R$ 6.9314718055989127e-1, R$ 6.9314718055989127e-1]
[<16 Cheby10Exp10: R$ 1.0000000000000002, R$ 1.0000000000000002]
Proc SSE2_Exp2::
Arguments @Number, @Flag
mov eax D@Flag
Test_if eax SSE_EXP_INT
cvtsi2ss xmm1 D@Number ; converts a signed integer to double
Test_Else_if eax SSE_EXP_FLOAT
;movss xmm1 D@Number ; converts a single precision float to double
cvtss2sd xmm1 X@Number ; converts a single precision float to double
Test_Else_if eax SSE_EXP_REAL8
mov eax D@Number | movss XMM1 X$eax
Test_Else_if eax SSE_EXP_QWORD
mov eax D@Number | movq XMM1 X$eax
Test_Else
xor eax eax | ExitP ; return 0 Invalid parameter
Test_End
movupd xmm0 X$Cheby10Exp0
mulpd xmm0 xmm1 | addpd xmm0 X$Cheby10Exp1
mulpd xmm0 xmm1 | addpd xmm0 X$Cheby10Exp2
mulpd xmm0 xmm1 | addpd xmm0 X$Cheby10Exp3
mulpd xmm0 xmm1 | addpd xmm0 X$Cheby10Exp4
mulpd xmm0 xmm1 | addpd xmm0 X$Cheby10Exp5
mulpd xmm0 xmm1 | addpd xmm0 X$Cheby10Exp6
mulpd xmm0 xmm1 | addpd xmm0 X$Cheby10Exp7
mulpd xmm0 xmm1 | addpd xmm0 X$Cheby10Exp8
mulpd xmm0 xmm1 | addpd xmm0 X$Cheby10Exp9
mulpd xmm0 xmm1 | addpd xmm0 X$Cheby10Exp10
EndP
It was supposed to work, because all the values and way to calculate as equal to the C version, right ?
I tried calculating exp of 5, but instead resulting in:
148.41315910257660342111558004055227962348766759387898904675284511
It returned:
31.987334657819531
:biggrin:
you created an Exp2 function.
Exp function calculates the value of e (the base of natural logarithms) raised to the power of x.
See the version I posted before.
I have to go to bed now, tomorrow I'll show you how to implement the 10th degree polynomials.
Thanks to dodicat (From Freebasic forum), we also have a a FreeBasic conversion of this LOlRemex app.
Can someone, please port it to asm ?
"Compile the file poly.bas to .exe
Click it to get the command prompt.
It runs thus:
At the prompt something like:
poly "-6,8,cos(x)*sin(x)"
Remember to use the double quotes as above!
meaning from x = -6 to 8 the function cos(x)*sin(x) is approximated.
Press the spacebar to build up the approximation until you are happy that the curves are similar.
The standard way of showing the approximation coefficients is shown on the console, but you have the option (upon exit by pressing <esc>), to save to a .bas file, the nested method, which is much faster arithmetically for the computer."
Reference: https://www.freebasic.net/forum/viewtopic.php?t=25897
With your suggested input lolremez.exe poly "-6,8,cos(x)*sin(x)" it says input error.
With lolremez.exe "poly -6,8,cos(x)*sin(x)" it says EXPECTED (, got '-' but proceeds afterwards with nice graphs.
Quote from: jj2007 on August 14, 2020, 06:58:05 PM
With your suggested input lolremez.exe poly "-6,8,cos(x)*sin(x)" it says input error.
With lolremez.exe "poly -6,8,cos(x)*sin(x)" it says EXPECTED (, got '-' but proceeds afterwards with nice graphs.
Tks a lot JJ. I saw your post on the original forum and gave another try and it compiled ok :azn: :azn: :azn: I saw the graphics and it is really nice.
The command works without this "poly" token
I used as:
LolRemez "-3,3,x^3-x^2"
pause
(https://i.ibb.co/yqDbCGQ/Image1.png) (https://ibb.co/Zx3krtM)
It also exports a function to be compiled in freebasic containing the polynomials :)
I tried also with
LolRemez "0,1,exp(x)"
And it worked, but exported somethign i failed to understand. It exported this:
(((((((((((((((((((((((((((((((+0.04829789212904446)*x-0.1792981145327924)*x-0.6677612347982678)*x+5.893146189666752)*x _
-17.1355277003133)*x _
+26.0763728821472)*x _
-19.69798344538533)*x _
+0.9900030668437522)*x _
+10.30400147926733)*x _
-8.096178275550896)*x _
+11.14649562485596)*x _
-31.20839952020678)*x _
+51.6443661394669)*x _
-51.27397166179219)*x _
+31.83357261868321)*x _
-10.78621348361321)*x _
-0.4973412406197402)*x _
+2.927004122935927)*x _
-1.894224695114594)*x _
+0.741826698340469)*x _
-0.2034182656840364)*x _
+0.04056360695256523)*x _
-0.005880298181518774)*x _
+0.0008169520352547144)*x _
+0.001343698483337935)*x _
+0.008335523163337043)*x _
+0.04166660157481193)*x _
+0.1666666677131286)*x _
+0.4999999999927096)*x _
+1.000000000000015)*x _
+1) _
All of these polynomials are to be used ?
Quote from: guga on August 14, 2020, 09:22:32 PMThe command works without this "poly" token
I'm not sure about that - the output looks a lot different.
I think that on each key you press, it is equivalent to one iteration on the original lolremez. I gave a try on the freebasic version with
LolRemez "0,1,exp(x)"
and hit 11 times the key until it produces a error of something around e-11. It produces these values:
(((((((((((+4.567028385771292e-007)*x+2.288390713140263e-006)*x+2.545465440196621e-005)*x _
+0.0001978554379130849)*x _
+0.001389191775841137)*x _
+0.008333228132164795)*x _
+0.04166668937151426)*x _
+0.1666666638148875)*x _
+0.5000000001833158)*x _
+0.9999999999954152)*x _
+1.000000000000019) _
I´ll give a try on them to see if it works, but i´m clueless how to use them yet. Better would be wait until Marinus wake up and explain to us how to use this lolremez app.
The graphics are very handy, btw, specially because we can actually see how much of the error we want to get rid off.
Ok, i gave a try on the generated values from dodicat FreeBasic version. On 10 iterations, it produced the values i used below. But the result completely lacks accuracy
[<16 Cheby10Exp0: R$ 4.567028385771292e-007]
[<16 Cheby10Exp1: R$ 2.288390713140263e-006]
[<16 Cheby10Exp2: R$ 2.545465440196621e-005]
[<16 Cheby10Exp3: R$ 0.0001978554379130849]
[<16 Cheby10Exp4: R$ 0.001389191775841137]
[<16 Cheby10Exp5: R$ 0.008333228132164795]
[<16 Cheby10Exp6: R$ 0.04166668937151426]
[<16 Cheby10Exp7: R$ 0.1666666638148875]
[<16 Cheby10Exp8: R$ 0.5000000001833158]
[<16 Cheby10Exp9: R$ 0.9999999999954152]
[<16 Cheby10Exp10: R$ 1.000000000000019]
Proc SSE2_Exp2::
Arguments @Number, @Flag
mov eax D@Flag
Test_if eax SSE_EXP_INT
cvtsi2ss xmm1 D@Number ; converts a signed integer to double
Test_Else_if eax SSE_EXP_FLOAT
;movss xmm1 D@Number ; converts a single precision float to double
cvtss2sd xmm1 X@Number ; converts a single precision float to double
Test_Else_if eax SSE_EXP_REAL8
mov eax D@Number | movss XMM1 X$eax
Test_Else_if eax SSE_EXP_QWORD
mov eax D@Number | movq XMM1 X$eax
Test_Else
xor eax eax | ExitP ; return 0 Invalid parameter
Test_End
movsd xmm0 X$Cheby10Exp0
mulsd xmm0 xmm1 | addsd xmm0 X$Cheby10Exp1
mulsd xmm0 xmm1 | addsd xmm0 X$Cheby10Exp2
mulsd xmm0 xmm1 | addsd xmm0 X$Cheby10Exp3
mulsd xmm0 xmm1 | addsd xmm0 X$Cheby10Exp4
mulsd xmm0 xmm1 | addsd xmm0 X$Cheby10Exp5
mulsd xmm0 xmm1 | addsd xmm0 X$Cheby10Exp6
mulsd xmm0 xmm1 | addsd xmm0 X$Cheby10Exp7
mulsd xmm0 xmm1 | addsd xmm0 X$Cheby10Exp8
mulsd xmm0 xmm1 | addsd xmm0 X$Cheby10Exp9
mulsd xmm0 xmm1 | addsd xmm0 X$Cheby10Exp10
EndP
I gave a try calculating the exp of 5. The correct value of exp^5 should be:
148.41315910257660342111558004055227962348766759387898904675284511
But this lolremez app returned to me:
147.452655481459771
(Similar result as the original lolremez, btw). So, better wait Marinus wake up and help on this to explain why we are having so huge margin of error.
you may also benefit from visiting the lolremez-wiki https://github.com/samhocevar/lolremez/wiki/Tutorial-1-of-5%3A-exp%28x%29-the-quick-way
the git repo is at https://github.com/samhocevar/lolremez you can build lolremez on windows using VS2019 and of course on linux
Hi Jack,
Guga already compiled it for us.
Hi Guga,
Exp(x) = Exp2(x / Logn(2.0))
1/Logn(2.0) = 1.4426950408889634073599246810019
Now we can use ( x * 1.4426950408889634073599246810019 ) to calculate the Exp.
E = x * 1.442695
To use the polynomials we have to use only the fractional part of E [0 to 1]
When done, scale the result by adding the integer part to the exponent field of the floating point result.
Here are the 10 degree polynomials and the SSE2 routine.
lolremez -d 10 -r 0:1 "2**x" "2**x"
// Approximation of f(x) = 2**x
// with weight function g(x) = 2**x
// on interval [ 0, 1 ]
// with a polynomial of degree 10.
double f(double x)
{
double u = 9.9522144894077596e-9;
u = u * x + 9.4609455637620191e-8;
u = u * x + 1.331271998884894e-6;
u = u * x + 1.5244659656760603e-5;
u = u * x + 1.5403957490414841e-4;
u = u * x + 1.3333543730974749e-3;
u = u * x + 9.6181294091749148e-3;
u = u * x + 5.5504108628244985e-2;
u = u * x + 2.402265069613608e-1;
u = u * x + 6.9314718055989127e-1;
return u * x + 1.0000000000000002;
}
.const
Chebylog2E real4 4 dup (1.44269504089) ; 1/Logn(2.0)
Cheby10Exp0 real4 4 dup (9.9522144894077596e-9)
Cheby10Exp1 real4 4 dup (9.4609455637620191e-8)
Cheby10Exp2 real4 4 dup (1.331271998884894e-6)
Cheby10Exp3 real4 4 dup (1.5244659656760603e-5)
Cheby10Exp4 real4 4 dup (1.5403957490414841e-4)
Cheby10Exp5 real4 4 dup (1.3333543730974749e-3)
Cheby10Exp6 real4 4 dup (9.6181294091749148e-3)
Cheby10Exp7 real4 4 dup (5.5504108628244985e-2)
Cheby10Exp8 real4 4 dup (2.402265069613608e-1)
Cheby10Exp9 real4 4 dup (6.9314718055989127e-1)
Cheby10Exp10 real4 4 dup (1.0000000000000002)
.code
align 16
SSE2_Exp_10: ; in: xmm0, out: xmm1
movaps xmm2,oword ptr Chebylog2E
mulps xmm2,xmm0
psrld xmm0,31
cvttps2dq xmm1,xmm2
psubd xmm1,xmm0
movdqa xmm0,xmm1
cvtdq2ps xmm1,xmm1
subps xmm2,xmm1
movaps xmm1,oword ptr Cheby10Exp0
pslld xmm0,23
mulps xmm1,xmm2
addps xmm1,oword ptr Cheby10Exp1
mulps xmm1,xmm2
addps xmm1,oword ptr Cheby10Exp2
mulps xmm1,xmm2
addps xmm1,oword ptr Cheby10Exp3
mulps xmm1,xmm2
addps xmm1,oword ptr Cheby10Exp4
mulps xmm1,xmm2
addps xmm1,oword ptr Cheby10Exp5
mulps xmm1,xmm2
addps xmm1,oword ptr Cheby10Exp6
mulps xmm1,xmm2
addps xmm1,oword ptr Cheby10Exp7
mulps xmm1,xmm2
addps xmm1,oword ptr Cheby10Exp8
mulps xmm1,xmm2
addps xmm1,oword ptr Cheby10Exp9
mulps xmm1,xmm2
addps xmm1,oword ptr Cheby10Exp10
paddd xmm0,xmm1
ret
Here is a real8 version, where I explain the above exp math in code.
align 16
SSE41_Exp_10: ; in: xmm0, out: xmm0
mulpd xmm0,oword ptr NaturalLogE
roundpd xmm1,xmm0,1 ; lose the fractional parts and save the integer parts in xmm1
subpd xmm0,xmm1 ; Subtract them, now it's inside the range, results are values between 0.0 to 1.0
cvtpd2dq xmm2,xmm1 ; save the integer parts in xmm2
psllq xmm2,52 ; shift quadwords 52 bits left ( size of real8 mantissa ) to the integer parts of the exponent parts
movapd xmm1,xmm0 ; copy the range
mulpd xmm0,oword ptr Exp_Cheby10_0_pd
addpd xmm0,oword ptr Exp_Cheby10_1_pd
mulpd xmm0,xmm1
addpd xmm0,oword ptr Exp_Cheby10_2_pd
mulpd xmm0,xmm1
addpd xmm0,oword ptr Exp_Cheby10_3_pd
mulpd xmm0,xmm1
addpd xmm0,oword ptr Exp_Cheby10_4_pd
mulpd xmm0,xmm1
addpd xmm0,oword ptr Exp_Cheby10_5_pd
mulpd xmm0,xmm1
addpd xmm0,oword ptr Exp_Cheby10_6_pd
mulpd xmm0,xmm1
addpd xmm0,oword ptr Exp_Cheby10_7_pd
mulpd xmm0,xmm1
addpd xmm0,oword ptr Exp_Cheby10_8_pd
mulpd xmm0,xmm1
addpd xmm0,oword ptr Exp_Cheby10_9_pd
mulpd xmm0,xmm1
addpd xmm0,oword ptr Exp_Cheby10_10_pd
paddq xmm0,xmm2 ; add the exponent parts to the floating point result
ret
Quote from: Siekmanski on August 15, 2020, 04:53:26 AM
Hi Jack,
Guga already compiled it for us.
Hi Guga,
Exp(x) = Exp2(x / Logn(2.0))
1/Logn(2.0) = 1.4426950408889634073599246810019
Now we can use ( x * 1.4426950408889634073599246810019 ) to calculate the Exp.
E = x * 1.442695
To use the polynomials we have to use only the fractional part of E [0 to 1]
When done, scale the result by adding the integer part to the exponent field of the floating point result.
Here are the 10 degree polynomials and the SSE2 routine.
lolremez -d 10 -r 0:1 "2**x" "2**x"
// Approximation of f(x) = 2**x
// with weight function g(x) = 2**x
// on interval [ 0, 1 ]
// with a polynomial of degree 10.
double f(double x)
{
double u = 9.9522144894077596e-9;
u = u * x + 9.4609455637620191e-8;
u = u * x + 1.331271998884894e-6;
u = u * x + 1.5244659656760603e-5;
u = u * x + 1.5403957490414841e-4;
u = u * x + 1.3333543730974749e-3;
u = u * x + 9.6181294091749148e-3;
u = u * x + 5.5504108628244985e-2;
u = u * x + 2.402265069613608e-1;
u = u * x + 6.9314718055989127e-1;
return u * x + 1.0000000000000002;
}
.const
Chebylog2E real4 4 dup (1.44269504089) ; 1/Logn(2.0)
Cheby10Exp0 real4 4 dup (9.9522144894077596e-9)
Cheby10Exp1 real4 4 dup (9.4609455637620191e-8)
Cheby10Exp2 real4 4 dup (1.331271998884894e-6)
Cheby10Exp3 real4 4 dup (1.5244659656760603e-5)
Cheby10Exp4 real4 4 dup (1.5403957490414841e-4)
Cheby10Exp5 real4 4 dup (1.3333543730974749e-3)
Cheby10Exp6 real4 4 dup (9.6181294091749148e-3)
Cheby10Exp7 real4 4 dup (5.5504108628244985e-2)
Cheby10Exp8 real4 4 dup (2.402265069613608e-1)
Cheby10Exp9 real4 4 dup (6.9314718055989127e-1)
Cheby10Exp10 real4 4 dup (1.0000000000000002)
.code
align 16
SSE2_Exp_10: ; in: xmm0, out: xmm1
movaps xmm2,oword ptr Chebylog2E
mulps xmm2,xmm0
psrld xmm0,31
cvttps2dq xmm1,xmm2
psubd xmm1,xmm0
movdqa xmm0,xmm1
cvtdq2ps xmm1,xmm1
subps xmm2,xmm1
movaps xmm1,oword ptr Cheby10Exp0
pslld xmm0,23
mulps xmm1,xmm2
addps xmm1,oword ptr Cheby10Exp1
mulps xmm1,xmm2
addps xmm1,oword ptr Cheby10Exp2
mulps xmm1,xmm2
addps xmm1,oword ptr Cheby10Exp3
mulps xmm1,xmm2
addps xmm1,oword ptr Cheby10Exp4
mulps xmm1,xmm2
addps xmm1,oword ptr Cheby10Exp5
mulps xmm1,xmm2
addps xmm1,oword ptr Cheby10Exp6
mulps xmm1,xmm2
addps xmm1,oword ptr Cheby10Exp7
mulps xmm1,xmm2
addps xmm1,oword ptr Cheby10Exp8
mulps xmm1,xmm2
addps xmm1,oword ptr Cheby10Exp9
mulps xmm1,xmm2
addps xmm1,oword ptr Cheby10Exp10
paddd xmm0,xmm1
ret
Now i see where you got those values :) Thanks a lot.
One question....Why using the range only to 0 to 1 ? On github, the example is from -1 to 1 . When should we use them ?
Does 0 to 1 means only positive exponential ? Ex: exp^5
and
-1 to 1 means negative ? Ex: exp^-5 ??
Btw...did you checked the margin of error ?
I´ll give a try on the method you use and the one that uses a double check for error, like this:
lolremez -d 10 -r -1:1 "exp(x)" "exp(x)"
What´s the difference from yours ? I mean why using "2**x" "2**x" ?
When i use "exp(x)" the values are different
Other question, why using...
movaps xmm2,oword ptr Chebylog2E
mulps xmm2,xmm0
psrld xmm0,31 ; Why ?
cvttps2dq xmm1,xmm2
psubd xmm1,xmm0
movdqa xmm0,xmm1
cvtdq2ps xmm1,xmm1
subps xmm2,xmm1
movaps xmm1,oword ptr Cheby10Exp0
pslld xmm0,23
(...)
wouldn´t only be necessary multiply this new value ?
Why not using only ?
mulsd xmm0 X$Chebylog2E and later add and multiply by the remaining values ?
tested for accuracy :)
It looses accuracy after the 4th digit after the "."
correct value:
exp^5 = 148.41315910257660342111558004055227962348766759387898904675284511
using lolremez
exp^5 = 148.413162231445312
This is similar to what i found yesterday. On the version i made it has 0 loss and the speed was not so different from the ones using lolremez. Wouldn´t be better if we try to optimize the version i ported ? I´m pretty sure there´s room for optimization yet.
It´s juust a matter of trying to understand what are the values found on the Exptable.
For what i saw, those values (The bigger ones) seems to be exponents of integers divided by 5000. While the others (around e-310 seems to be some shifted values of some sort.)
Ex: 2.04582073601169570e-16 = exp(-180628/5000)
2.42294657314453310e-310 = exp(-712.9163944364413) = exp(some shift or division by what ????)
4.99974487227263260e-17 = exp(-187673/5000)
And the "SSE_Shifter4" seems to be formed by 2 exponentials of log(2). Like this:
; (x) = exp(51 log(2) + log(3)) = 6.755399441055744e+15
; (y) = exp(-60 log(2)) = 8.67361737988403547e-19
[SSE_Shifter4: R$ 6.755399441055744e+15, R$ 6.755399441055744e+15,
R$ 8.67361737988403547e-19, R$ 8.67361737988403547e-19]
Remember that real4 has only 7.2 digits precision so, you need real8 to get more digits ( I believe 15 for real8 )
QuoteOne question....Why using the range only to 0 to 1 ? On github, the example is from -1 to 1 . When should we use them ?
-> For example if you only need to do calculations in the range -1 to 1 as in e.g. Audio routines.
It's saves you the correction code, splitting it up in fractions and integers and glue everything together into a float.
QuoteDoes 0 to 1 means only positive exponential ? Ex: exp^5
and
-1 to 1 means negative ? Ex: exp^-5 ??
-> No, the signs are taken care of.
QuoteBtw...did you checked the margin of error ?
-> No
QuoteI´ll give a try on the method you use and the one that uses a double check for error, like this:
lolremez -d 10 -r -1:1 "exp(x)" "exp(x)"
What´s the difference from yours ? I mean why using "2**x" "2**x" ?
When i use "exp(x)" the values are different
-> I use the Exp2 polynomials to calculate Exp()
2**x = 2 raised to the power of x = Exp2()
Why? because, Exp() calculates the value of e (the base of natural logarithms) raised to the power of x.
-> By simplifying the math, we have 1 set of polynomials to calculate 2 functions, Exp() and Exp2() :cool:
Exp(x) = Exp2(x / Logn(2.0))
Create a reciprocal: 1/Logn(2.0) = 1.442695
Now we can use ( x * 1.442695 ) to calculate Exp().
QuoteOther question, why using...
movaps xmm2,oword ptr Chebylog2E
mulps xmm2,xmm0
psrld xmm0,31 ; Why ?
cvttps2dq xmm1,xmm2
psubd xmm1,xmm0
movdqa xmm0,xmm1
cvtdq2ps xmm1,xmm1
subps xmm2,xmm1
movaps xmm1,oword ptr Cheby10Exp0
pslld xmm0,23
-> In SSE2 we don't have a floor instruction ( SSE4_1 does: roundps xmm1,xmm0,1 )
We need that to work outside the range of the polynomials.
So we have to create it ourselfs:
movaps xmm2,oword ptr Chebylog2E
mulps xmm2,xmm0
psrld xmm0,31 ; put sign-bits in position
cvttps2dq xmm1,xmm2 ; convert with truncation, packed real4 to packed signed 32bit integers
psubd xmm1,xmm0 ; signed integers - sign bits
movdqa xmm0,xmm1 ; save them
cvtdq2ps xmm1,xmm1 ; convert 32bit packed integers to packed real4
subps xmm2,xmm1 ; get the fractional parts ( range 0.0 to 1.0 )
movaps xmm1,oword ptr Cheby10Exp0
pslld xmm0,23 ; shift 32bit integers 23 bits to the left, ( size of real4 mantissa ) into the exponent fields
at the end of the routine:
paddd xmm0,xmm1 ; add the exponent fields to the real4 results
Quotewouldn´t only be necessary multiply this new value ?
Why not using only ?
mulsd xmm0 X$Chebylog2E and later add and multiply by the remaining values ?
Logged
-> Because it doesn't work that way, we need a range to work with the polynomials.
edit:
Improved version:
.data
E_1 REAL4 1.0000
E_2 REAL4 0.5000
E_3 REAL4 0.16666
E_4 REAL4 0.04166
E_5 REAL4 0.00833
.code
ASM_Exp_2 proc
movss xmm4, xmm0
movss xmm1, xmm0
addss xmm1, E_1
mulss xmm4, xmm0
movss xmm2, E_2
vfmadd231ss xmm1, xmm2, xmm4
mulss xmm4, xmm0
movss xmm3, E_3
vfmadd231ss xmm1, xmm3, xmm4
mulss xmm4, xmm0
movss xmm2, E_4
vfmadd231ss xmm1, xmm2, xmm4
mulss xmm4, xmm0
movss xmm3, E_5
vfmadd231ss xmm1, xmm3, xmm4
movss xmm0, xmm1
ret
ASM_Exp_2 endp
end
Here's my rough & ready effort.
.data
N_1 REAL4 1.000000000000000
N_2 REAL4 2.000000000000000
N_3 REAL4 6.000000000000000
N_4 REAL4 24.000000000000000
N_5 REAL4 120.000000000000000
.code
ASM_Exp proc
movss xmm1, N_1
movss xmm2, xmm0
movss xmm3, xmm1
movss xmm5, xmm1
addss xmm1, xmm0 ;x+1
movss xmm4, N_2
divss xmm3, xmm4 ;1/n!
mulss xmm2, xmm0 ;x^n
vfmadd231ss xmm1, xmm2, xmm3
movss xmm3, xmm5
movss xmm4, N_3
divss xmm3, xmm4 ;1/n!
mulss xmm2, xmm0 ;x^n
vfmadd231ss xmm1, xmm2, xmm3
movss xmm3, xmm5
movss xmm4, N_4
divss xmm3, xmm4 ;1/n!
mulss xmm2, xmm0 ;x^n
vfmadd231ss xmm1, xmm2, xmm3
movss xmm3, xmm5
movss xmm4, N_5
divss xmm3, xmm4 ;1/n!
mulss xmm2, xmm0 ;x^n
vfmadd231ss xmm1, xmm2, xmm3
movss xmm0, xmm1
ret
ASM_Exp endp
end
Quote from: InfiniteLoop on August 15, 2020, 11:20:48 PM
Here's my rough & ready effort.
.data
N_1 REAL4 1.000000000000000
N_2 REAL4 2.000000000000000
N_3 REAL4 6.000000000000000
N_4 REAL4 24.000000000000000
N_5 REAL4 120.000000000000000
.code
ASM_Exp proc
movss xmm1, N_1
movss xmm2, xmm0
movss xmm3, xmm1
movss xmm5, xmm1
addss xmm1, xmm0 ;x+1
movss xmm4, N_2
divss xmm3, xmm4 ;1/n!
mulss xmm2, xmm0 ;x^n
vfmadd231ss xmm1, xmm2, xmm3
movss xmm3, xmm5
movss xmm4, N_3
divss xmm3, xmm4 ;1/n!
mulss xmm2, xmm0 ;x^n
vfmadd231ss xmm1, xmm2, xmm3
movss xmm3, xmm5
movss xmm4, N_4
divss xmm3, xmm4 ;1/n!
mulss xmm2, xmm0 ;x^n
vfmadd231ss xmm1, xmm2, xmm3
movss xmm3, xmm5
movss xmm4, N_5
divss xmm3, xmm4 ;1/n!
mulss xmm2, xmm0 ;x^n
vfmadd231ss xmm1, xmm2, xmm3
movss xmm0, xmm1
ret
ASM_Exp endp
end
My computer is to old, can't handle FMA instructions. :sad: