
AMD A6-3500 APU with Radeon(tm) HD Graphics (SSE3)
5396 cycles for 100 * Sse2_log10_precise (Guga SSE2 Log10 precise )
11737 cycles for 100 * SmplMath: fSlv MyReal8 = log(MyExpo)
Intel(R) Core(TM) i5-2450M CPU @ 2.50GHz (SSE4)
3163 cycles for 100 * Sse2_log10_precise (Guga SSE2 Log10 precise )
5060 cycles for 100 * Log10
3211 cycles for 100 * Sse2_log10_precise (Guga SSE2 Log10 precise )
4933 cycles for 100 * Log10
3209 cycles for 100 * Sse2_log10_precise (Guga SSE2 Log10 precise )
4823 cycles for 100 * Log10
3186 cycles for 100 * Sse2_log10_precise (Guga SSE2 Log10 precise )
4773 cycles for 100 * Log10
3219 cycles for 100 * Sse2_log10_precise (Guga SSE2 Log10 precise )
4739 cycles for 100 * Log10
536 bytes for Sse2_log10_precise (Guga SSE2 Log10 precise )
16 bytes for Log10
0.699076046207356 for Sse2_log10_precise (Guga SSE2 Log10 precise )
0.698970004336019 for Log10
0.6989700043360188 expected
Hi JJ.
I think you used the older version. The new one produces the correct value:
0.698970004336019 for Sse2_log10_precise (Guga SSE2 Log10 precise of 5)
The error is on the last 2 digits only :) Your version have less, digits then mine. So, on RosAsm, the full number (18 digits) is: 0.6989700043360188
57 with only the last 2 digits that looses precision.
Please, try with FastLog10a.zip i posted a few comments earlier :)
I succeeded to optimize a bit further, but i´m trying to do the proper math to optimize it even more and avoid using so many registers. I need now to optimize this part (This modification below also produces the correct result and gained extra speed):
(...)
; same as SSE_Place_Log0
unpcklpd XMM2 XMM2 | mulpd XMM2 X$SSE_Log1020 | addpd XMM2 XMM3 | addpd XMM4 XMM2
movupd XMM1 X$Log_Coeff0
movupd XMM2 XMM0 | mulpd XMM2 XMM2 | mulsd XMM2 XMM2 | mulsd XMM2 XMM0
mulpd XMM1 XMM0 | addpd XMM1 X$SSE_Log10Var1
mulpd XMM1 XMM0
addpd XMM1 X$SSE_Log10Var2 | mulpd XMM1 XMM2
movupd xmm2 X$SSE_Log10Var3 | mulpd XMM2 XMM0
;--------------------
; all necessary data are stored in both Packed Double paisr in registers xmm4, xmm1 and xmm2. Modified original version. With shuffle is faster
; We only need to sum all of them
addpd xmm1 xmm4 ; sum all double quads from xmm1 and xmm4. xmm1 = xmm1+xmm4
addpd xmm1 xmm2 ; sum both doubles of the result above with both doubles of xmm2. xmm1 => xmm1+xmm4+xmm2
movupd xmm0 xmm1 ; Now we need to sum both doubles of the resultrant register in xmm1. To do that we copy xmm1 to xmm0
pshufd XMM0 XMM0 SSE_SWAP_QWORDS ; and swap their pairs of doubles. On this way we have in xmm0 and xmm1 the inverted order of the pairs. SSE_SWAP_QWORDS = 78
addpd xmm0 xmm1 ; and we only need now to sum them up. xmm0 = xmm1 => xmm0 = xmm1+xmm4+xmm2
The math to do this is the formula below ("Lo" and "Hi" are the low and Hi parts of the Double quadword on each register or variable):
Hiquadxmm0 = xmm0.Lo^2 * ( (Log_Coeff0.Lo*xmm0.Lo+SSE_Log10Var1.Lo)* xmm0.Lo+ SSE_Log10Var2.Lo) + xmm4.Lo + (Log10Var3.Lo*xmm0.Lo)
Loquadxmm0 = xmm0.Hi^5 * ( (Log_Coeff0.Hi*xmm0.Hi+SSE_Log10Var1.Hi)* xmm0.Hi+ SSE_Log10Var2.Hi) + xmm4.Hi + (Log10Var3.Hi*xmm0.Hi)
Final xmm0 = Hiquadxmm0 + Loquadxmm0
I´m trying to recreate the above equations using less registers (forcing it to use only xmm0 and xmm1 and xmm2) and also keeping the speed, with shorter computations, but it is hard to find the proper way.
I tried this below, but the result is incorrect. I´m missing something.
;---------------------------- This is good for speed, but i´m missing some calculation, because it now produced a incorrect value.
; 1st step xmm4.Hi + (Log10Var3.Hi*xmm0.Hi)
movupd xmm2 X$SSE_Log10Var3 | mulpd XMM2 XMM0 | addpd xmm2 xmm4
; ( (Log_Coeff0.Hi*xmm0.Hi+SSE_Log10Var1.Hi)* xmm0.Hi+ SSE_Log10Var2.Hi)
movupd XMM1 X$Log_Coeff0 | mulpd XMM1 XMM0 | addpd XMM1 X$SSE_Log10Var2
; since xmm0.hi and xmm.lo are parto f the same xmm0 we can simply multiply this to get
; Lo and Hi Quad xmm0. For the Lo Quad we must then simply mul by Loquad mnore 3 times to get xmm0.Hi^5
mulpd XMM1 XMM0 | mulpd XMM1 XMM0
; ok, now we get the Hi Part xmm0.Lo^2 * ( (Log_Coeff0.Lo*xmm0.Lo+SSE_Log10Var1.Lo)* xmm0.Lo+ SSE_Log10Var2.Lo)
; To we get the LoPart we do:
mulsd xmm1 xmm0 | mulsd xmm1 xmm0 | mulsd xmm1 xmm0
; Now we can finally add xmm2 to xmm1
addpd xmm1 xmm2
; And finally we Exchange the data to add both hi and lo quads
movupd xmm0 xmm1 ; Now we need to sum both doubles of the resultrant register in xmm1. To do that we copy xmm1 to xmm0
pshufd XMM0 XMM0 SSE_SWAP_QWORDS ; and swap their pairs of doubles. On this way we have in xmm0 and xmm1 the inverted order of the pairs
addpd xmm0 xmm1 ; and we only need now to sum them up. xmm0 = xmm1 => xmm0 = xmm1+xmm4+xmm2