Fast Log10 approximation

guga · August 17, 2020, 05:20:27 PM

Quote from: jj2007 on August 17, 2020, 10:20:09 AM
Quote from: HSE on August 16, 2020, 07:47:28 AM

Code Select Expand
AMD A6-3500 APU with Radeon(tm) HD Graphics (SSE3) 5396 cycles for 100 * Sse2_log10_precise (Guga SSE2 Log10 precise ) 11737 cycles for 100 * SmplMath: fSlv MyReal8 = log(MyExpo)

Code Select Expand
Intel(R) Core(TM) i5-2450M CPU @ 2.50GHz (SSE4) 3163 cycles for 100 * Sse2_log10_precise (Guga SSE2 Log10 precise ) 5060 cycles for 100 * Log10 3211 cycles for 100 * Sse2_log10_precise (Guga SSE2 Log10 precise ) 4933 cycles for 100 * Log10 3209 cycles for 100 * Sse2_log10_precise (Guga SSE2 Log10 precise ) 4823 cycles for 100 * Log10 3186 cycles for 100 * Sse2_log10_precise (Guga SSE2 Log10 precise ) 4773 cycles for 100 * Log10 3219 cycles for 100 * Sse2_log10_precise (Guga SSE2 Log10 precise ) 4739 cycles for 100 * Log10 536 bytes for Sse2_log10_precise (Guga SSE2 Log10 precise ) 16 bytes for Log10 0.699076046207356 for Sse2_log10_precise (Guga SSE2 Log10 precise ) 0.698970004336019 for Log10 0.6989700043360188 expected

Hi JJ.
I think you used the older version. The new one produces the correct value:

0.698970004336019 for Sse2_log10_precise (Guga SSE2 Log10 precise of 5)

The error is on the last 2 digits only :) Your version have less, digits then mine. So, on RosAsm, the full number (18 digits) is: 0.698970004336018857 with only the last 2 digits that looses precision.

Please, try with FastLog10a.zip i posted a few comments earlier :)

I succeeded to optimize a bit further, but i´m trying to do the proper math to optimize it even more and avoid using so many registers. I need now to optimize this part (This modification below also produces the correct result and gained extra speed):

Code Select


(...)
     ; same as SSE_Place_Log0
    unpcklpd XMM2 XMM2 | mulpd XMM2 X$SSE_Log1020 | addpd XMM2 XMM3 | addpd XMM4 XMM2

    movupd XMM1 X$Log_Coeff0
    movupd XMM2 XMM0 | mulpd XMM2 XMM2 | mulsd XMM2 XMM2 | mulsd XMM2 XMM0
    mulpd XMM1 XMM0 | addpd XMM1 X$SSE_Log10Var1

    mulpd XMM1 XMM0
    addpd XMM1 X$SSE_Log10Var2 | mulpd XMM1 XMM2

    movupd xmm2 X$SSE_Log10Var3 | mulpd XMM2 XMM0

    ;--------------------
    ; all necessary data are stored in both Packed Double paisr in registers xmm4, xmm1 and xmm2. Modified original version. With shuffle is faster
    ; We only need to sum all of them
    addpd xmm1 xmm4 ; sum all double quads from xmm1 and xmm4. xmm1 = xmm1+xmm4
    addpd xmm1 xmm2 ; sum both doubles of the result above with both doubles of xmm2. xmm1 => xmm1+xmm4+xmm2
    movupd xmm0 xmm1 ; Now we need to sum both doubles of the resultrant register in xmm1. To do that we copy xmm1 to xmm0
    pshufd XMM0 XMM0 SSE_SWAP_QWORDS ; and swap their pairs of doubles. On this way we have in xmm0 and xmm1 the inverted order of the pairs. SSE_SWAP_QWORDS = 78
    addpd xmm0 xmm1 ; and we only need now to sum them up. xmm0 = xmm1 => xmm0 = xmm1+xmm4+xmm2

The math to do this is the formula below ("Lo" and "Hi" are the low and Hi parts of the Double quadword on each register or variable):

Hiquadxmm0 = xmm0.Lo^2 * ( (Log_Coeff0.Lo*xmm0.Lo+SSE_Log10Var1.Lo)* xmm0.Lo+ SSE_Log10Var2.Lo) + xmm4.Lo + (Log10Var3.Lo*xmm0.Lo)
Loquadxmm0 = xmm0.Hi^5 * ( (Log_Coeff0.Hi*xmm0.Hi+SSE_Log10Var1.Hi)* xmm0.Hi+ SSE_Log10Var2.Hi) + xmm4.Hi + (Log10Var3.Hi*xmm0.Hi)

Final xmm0 = Hiquadxmm0 + Loquadxmm0

I´m trying to recreate the above equations using less registers (forcing it to use only xmm0 and xmm1 and xmm2) and also keeping the speed, with shorter computations, but it is hard to find the proper way.

I tried this below, but the result is incorrect. I´m missing something.

Code Select



;---------------------------- This is good for speed, but i´m missing some calculation, because it now produced a incorrect value.

    ;   1st step xmm4.Hi + (Log10Var3.Hi*xmm0.Hi)
    movupd xmm2 X$SSE_Log10Var3 | mulpd XMM2 XMM0 | addpd xmm2 xmm4
    ;   ( (Log_Coeff0.Hi*xmm0.Hi+SSE_Log10Var1.Hi)* xmm0.Hi+ SSE_Log10Var2.Hi)
    movupd XMM1 X$Log_Coeff0 | mulpd XMM1 XMM0 | addpd XMM1 X$SSE_Log10Var2
    ;   since xmm0.hi and xmm.lo are parto f the same xmm0 we can simply multiply this to get
    ;   Lo and Hi Quad xmm0. For the Lo Quad we must then simply mul by Loquad mnore 3 times to get xmm0.Hi^5
    mulpd XMM1 XMM0 | mulpd XMM1 XMM0
    ;   ok, now we get the Hi Part xmm0.Lo^2 * ( (Log_Coeff0.Lo*xmm0.Lo+SSE_Log10Var1.Lo)* xmm0.Lo+ SSE_Log10Var2.Lo)
    ;   To we get the LoPart we do:
    mulsd xmm1 xmm0 | mulsd xmm1 xmm0 | mulsd xmm1 xmm0
    ;   Now we can finally add xmm2 to xmm1
    addpd xmm1 xmm2
    ;   And finally we Exchange the data to add both hi and lo quads
    movupd xmm0 xmm1 ; Now we need to sum both doubles of the resultrant register in xmm1. To do that we copy xmm1 to xmm0
    pshufd XMM0 XMM0 SSE_SWAP_QWORDS ; and swap their pairs of doubles. On this way we have in xmm0 and xmm1 the inverted order of the pairs
    addpd xmm0 xmm1 ; and we only need now to sum them up. xmm0 = xmm1 => xmm0 = xmm1+xmm4+xmm2

Siekmanski · August 17, 2020, 07:27:53 PM

Hi Guga,

You can create a reciprocal with these SSE2 instructions:
rcpps - Approximates the reciprocal of 4 packed floats.
rcpss - Approximates the reciprocal of a single float.

guga · August 17, 2020, 08:04:02 PM

Hi marinus.

Great

But, how to apply it on the previous code to we minimize the usage of so many registers and make it shorter and faster ? (Mine version does not use divisions)

I´m trying to simplify even further this:

Hiquadxmm0 = xmm0.Lo^2 * ( (Log_Coeff0.Lo*xmm0.Lo+SSE_Log10Var1.Lo)* xmm0.Lo+ SSE_Log10Var2.Lo) + xmm4.Lo + (Log10Var3.Lo*xmm0.Lo)
Loquadxmm0 = xmm0.Hi^5 * ( (Log_Coeff0.Hi*xmm0.Hi+SSE_Log10Var1.Hi)* xmm0.Hi+ SSE_Log10Var2.Hi) + xmm4.Hi + (Log10Var3.Hi*xmm0.Hi)
Final xmm0 = Hiquadxmm0 + Loquadxmm0

But, didn´t suceeded yet

Siekmanski · August 17, 2020, 08:20:41 PM

Just noticed you use real8.

unfortunately there are no rcppd and rcpsd instructions.

guga · August 17, 2020, 08:50:53 PM

Ok, I think i got it working as expected.

I had to do this:

; The formula to retrieve xmm0 is as follows:
; Hiquadxmm0 = xmm0.Lo^2 * ( (Log_Coeff0.Lo*xmm0.Lo+SSE_Log10Var1.Lo)* xmm0.Lo+ SSE_Log10Var2.Lo) + xmm4.Lo + (Log10Var3.Lo*xmm0.Lo)
; Loquadxmm0 = xmm0.Hi^5 * ( (Log_Coeff0.Hi*xmm0.Hi+SSE_Log10Var1.Hi)* xmm0.Hi+ SSE_Log10Var2.Hi) + xmm4.Hi + (Log10Var3.Hi*xmm0.Hi)
; Final xmm0 = Hiquadxmm0 + Loquadxmm0

; 1st step we calculate low and hi values of xmm4.Hi + (Log10Var3.Hi*xmm0.Hi) and xmm4.Lo + (Log10Var3.Lo*xmm0.Lo)
movupd xmm2 X$SSE_Log10Var3 | mulpd XMM2 XMM0 | addpd xmm2 xmm4 ; Save the result in xmm2

; Now. What do both have in common ? Both have in common this:
; xmm0.Lo * ( (Log_Coeff0.Lo*xmm0.Lo+SSE_Log10Var1.Lo)* xmm0.Lo+ SSE_Log10Var2.Lo)
; xmm0.Hi * ( (Log_Coeff0.Hi*xmm0.Hi+SSE_Log10Var1.Hi)* xmm0.Hi+ SSE_Log10Var2.Hi)
; So we compute this only once and save it on a xmm1 register to we get both low and hi values

; ( (Log_Coeff0.Hi*xmm0.Hi+SSE_Log10Var1.Hi)* xmm0.Hi+ SSE_Log10Var2.Hi)
movupd XMM1 X$Log_Coeff0 | mulpd XMM1 XMM0 | addpd XMM1 X$SSE_Log10Var1 | mulpd xmm1 xmm0 | addpd XMM1 X$SSE_Log10Var2

; now we get the LoQuad xmm0.Lo^2 * ( (Log_Coeff0.Lo*xmm0.Lo+SSE_Log10Var1.Lo)* xmm0.Lo+ SSE_Log10Var2.Lo)
; and the hi quad xmm0.Lo^2 * ( (Log_Coeff0.Lo*xmm0.Lo+SSE_Log10Var1.Lo)* xmm0.Lo+ SSE_Log10Var2.Lo)
; at once. (Because they share the same values of xmm0.Lo^2)
mulpd xmm1 xmm0 | mulpd xmm1 xmm0; xmm0.Lo^2 is calculated and applied in both low and hi doubles of xmm1

; let´ now do the same for xmm0.Hi^5
; since we already calculated xmm0.Hi^2, we need only to multiply 3 times with xmm0 to we get the value of Loquadxmm0
mulsd xmm1 xmm0 | mulsd xmm1 xmm0 | mulsd xmm1 xmm0

; Now we can finally add xmm2 to xmm1
addpd xmm1 xmm2

; And finally we Exchange the data to add both hi and lo quads

movupd xmm0 xmm1 ; Now we need to sum both doubles of the resultrant register in xmm1. To do that we copy xmm1 to xmm0
pshufd xmm0 xmm0 SSE_SWAP_QWORDS ; and swap their pairs of doubles. On this way we have in xmm0 and xmm1 the inverted order of the pairs. This is a simple equate whose value is 78 to perform the swap from pshufd
addpd xmm0 xmm1 ; and we only need now to sum them up. xmm0 = xmm1 => xmm0 = xmm1+xmm4+xmm2

Although i removed extra registers, on this part of the code, it still need 17 instructions to calculate the final result at xmm0

I wonder if this can be optimized even further in SSE2

jj2007 · August 17, 2020, 09:17:56 PM

Quote from: guga on August 17, 2020, 05:20:27 PMHi JJ.
I think you used the older version. The new one produces the correct value:

0.698970004336019 for Sse2_log10_precise (Guga SSE2 Log10 precise of 5)

The error is on the last 2 digits only :) Your version have less, digits then mine. So, on RosAsm, the full number (18 digits) is: 0.698970004336018857 with only the last 2 digits that looses precision.

Please, try with FastLog10a.zip i posted a few comments earlier :)

The only changes necessary are useC=0 and in TestB, line 357:
SetFloat MyReal8=Log10(MyExpo)
instead of ExpXY(MyBaseB, MyExpo, MyReal8)

Code Select

Intel(R) Core(TM) i5-2450M CPU @ 2.50GHz (SSE4)

3157    cycles for 100 * Sse2_log10_precise (Guga SSE2 Log10 precise of 5)
4795    cycles for 100 * MasmBasic Log10

3166    cycles for 100 * Sse2_log10_precise (Guga SSE2 Log10 precise of 5)
4830    cycles for 100 * MasmBasic Log10

3145    cycles for 100 * Sse2_log10_precise (Guga SSE2 Log10 precise of 5)
4772    cycles for 100 * MasmBasic Log10

3212    cycles for 100 * Sse2_log10_precise (Guga SSE2 Log10 precise of 5)
4767    cycles for 100 * MasmBasic Log10

3145    cycles for 100 * Sse2_log10_precise (Guga SSE2 Log10 precise of 5)
4765    cycles for 100 * MasmBasic Log10

0.698970004336019 for Sse2_log10_precise (Guga SSE2 Log10 precise of 5)
0.698970004336019 for MasmBasic Log10

We are close

As regards precision:

include \masm32\MasmBasic\MasmBasic.inc ; download
SetGlobals MyReal10:REAL10
Init
PrintLine "0.698970004336018804786 (expected)"
SetFloat MyReal10=Log10(5.0)
Print Str$(MyReal10)
EndOfCode

Code Select

0.698970004336018804786 (expected)
0.6989700043360188048

guga · August 17, 2020, 09:47:28 PM

Quote from: jj2007 on August 17, 2020, 09:17:56 PM
Quote from: guga on August 17, 2020, 05:20:27 PMHi JJ.
I think you used the older version. The new one produces the correct value:

0.698970004336019 for Sse2_log10_precise (Guga SSE2 Log10 precise of 5)

The error is on the last 2 digits only :) Your version have less, digits then mine. So, on RosAsm, the full number (18 digits) is: 0.698970004336018857 with only the last 2 digits that looses precision.

Please, try with FastLog10a.zip i posted a few comments earlier :)

The only changes necessary are useC=0 and in TestB, line 357:
SetFloat MyReal8=Log10(MyExpo)
instead of ExpXY(MyBaseB, MyExpo, MyReal8)

Code Select Expand
Intel(R) Core(TM) i5-2450M CPU @ 2.50GHz (SSE4) 3157 cycles for 100 * Sse2_log10_precise (Guga SSE2 Log10 precise of 5) 4795 cycles for 100 * MasmBasic Log10 3166 cycles for 100 * Sse2_log10_precise (Guga SSE2 Log10 precise of 5) 4830 cycles for 100 * MasmBasic Log10 3145 cycles for 100 * Sse2_log10_precise (Guga SSE2 Log10 precise of 5) 4772 cycles for 100 * MasmBasic Log10 3212 cycles for 100 * Sse2_log10_precise (Guga SSE2 Log10 precise of 5) 4767 cycles for 100 * MasmBasic Log10 3145 cycles for 100 * Sse2_log10_precise (Guga SSE2 Log10 precise of 5) 4765 cycles for 100 * MasmBasic Log10 0.698970004336019 for Sse2_log10_precise (Guga SSE2 Log10 precise of 5) 0.698970004336019 for MasmBasic Log10

We are close

As regards precision:

include \masm32\MasmBasic\MasmBasic.inc ; download
SetGlobals MyReal10:REAL10
Init
PrintLine "0.698970004336018804786 (expected)"
SetFloat MyReal10=Log10(5.0)
Print Str$(MyReal10)
EndOfCode

Code Select Expand
0.698970004336018804786 (expected) 0.6989700043360188048

Great !. Thanks, JJ. I´ll change the code and compare to yours. I´m trying to see if i can optimize a bit further before post a newer version :)

guga · August 19, 2020, 12:46:13 AM

New version.

A bit faster (Something around 10% faster).

I think i reached my limit of optimization

. If someone wants to give a try and optimize further, please do

Accuracy not affected by the current optimization.

Precision is:
16 digits after the "." for normalized values
12 digits after the "." for denormalized values

My timmings:

Code Select


AMD Ryzen 5 2400G with Radeon Vega Graphics     (SSE4)

2640    cycles for 100 * Sse2_log10_precise (Guga SSE2 Log10 precise of 5)
6699    cycles for 100 * Log10 (MasmBasic, Log10 of 5)

2507    cycles for 100 * Sse2_log10_precise (Guga SSE2 Log10 precise of 5)
6765    cycles for 100 * Log10 (MasmBasic, Log10 of 5)

2471    cycles for 100 * Sse2_log10_precise (Guga SSE2 Log10 precise of 5)
6774    cycles for 100 * Log10 (MasmBasic, Log10 of 5)

2516    cycles for 100 * Sse2_log10_precise (Guga SSE2 Log10 precise of 5)
6785    cycles for 100 * Log10 (MasmBasic, Log10 of 5)

2569    cycles for 100 * Sse2_log10_precise (Guga SSE2 Log10 precise of 5)
6600    cycles for 100 * Log10 (MasmBasic, Log10 of 5)

0.698970004336019 for Sse2_log10_precise (Guga SSE2 Log10 precise of 5)
0.698970004336019 for Log10 (MasmBasic, Log10 of 5)

--- ok ---

Updated version also on the 1st post

jj2007 · August 19, 2020, 01:06:03 AM

Code Select

Intel(R) Core(TM) i5-2450M CPU @ 2.50GHz (SSE4)

2867    cycles for 100 * Sse2_log10_precise (Guga SSE2 Log10 precise of 5)
4731    cycles for 100 * Log10 (MasmBasic, Log10 of 5)

2860    cycles for 100 * Sse2_log10_precise (Guga SSE2 Log10 precise of 5)
4780    cycles for 100 * Log10 (MasmBasic, Log10 of 5)

2834    cycles for 100 * Sse2_log10_precise (Guga SSE2 Log10 precise of 5)
4775    cycles for 100 * Log10 (MasmBasic, Log10 of 5)

2878    cycles for 100 * Sse2_log10_precise (Guga SSE2 Log10 precise of 5)
4760    cycles for 100 * Log10 (MasmBasic, Log10 of 5)

2858    cycles for 100 * Sse2_log10_precise (Guga SSE2 Log10 precise of 5)
4742    cycles for 100 * Log10 (MasmBasic, Log10 of 5)

0.698970004336019 for Sse2_log10_precise (Guga SSE2 Log10 precise of 5)
0.698970004336019 for Log10 (MasmBasic, Log10 of 5)

guga · August 19, 2020, 02:04:22 AM

Great !

Small improve.
Replace the cmp ecx, 1023; with a test ecx (not 1023), like this:

Code Select

		test ecx, 0FFFFFC00h
		 jz     loc_4C6BFC
;                 cmp     ecx, 1023; Check for special cases (NAN, INF etc). Otherwise, jmp to the start of computation
;                 jbe     loc_4C6BFC

It should speed up more 2 or 3 %

Btw...take a look at the source. I commented it a little bit and also changed the Tables to hold the proper values. Log10_Table_T and SSE_Log1020 . Now the 2nd Real8 on both tables can be zeroed and later improve a bit more using a array of 2 Real 8 instead of 4 Real8 values on table Log10_Table_T and Log10_Table_B and perhaps using mulsd in some parts of teh code rather mulpd. Didn´t checked this part yet, but i guess that it is fast and precise enough already :)

Log10_Table_T table is formed by the log10(x) where x = 1 to 2

I didn´t found out yet how those values where generated, but it was calculated as follow:

Code Select


; Log10_Table_T is formed by a log10(x) where x is the same as below:
; dq log10(1), log10(1), log10(1.01587301587301580), log10(1)
; dq log10(1.03121852970795570), log10(1),  log10(1.04703476482617600), log10(1)
; dq log10(1.06224066390041490), log10(1),  log10(1.07789473684210520), log10(1)
; dq log10(1.09401709401709410), log10(1),  log10(1.10942578548212340), log10(1)
; dq log10(1.12527472527472530), log10(1),  log10(1.14031180400890860), log10(1)
; dq log10(1.15640880858272150), log10(1),  log10(1.17162471395881010), log10(1)
; dq log10(1.18724637681159420), log10(1),  log10(1.20329024676850760), log10(1)
; dq log10(1.21904761904761890), log10(1),  log10(1.23447860156720910), log10(1)
; dq log10(1.25030525030525030), log10(1),  log10(1.26576019777503080), log10(1)
; dq log10(1.28160200250312890), log10(1),  log10(1.29702343255224830), log10(1)
; dq log10(1.31282051282051280), log10(1),  log10(1.32814526588845650), log10(1)
; dq log10(1.34383202099737530), log10(1),  log10(1.35899137358991370), log10(1)
; dq log10(1.37541974479516460), log10(1),  log10(1.39035980991174470), log10(1)
; dq log10(1.40659340659340670), log10(1),  log10(1.42222222222222210), log10(1)
; dq log10(1.43719298245614050), log10(1),  log10(1.45351312987934710), log10(1)
; dq log10(1.46915351506456250), log10(1),  log10(1.48405797101449280), log10(1)
; dq log10(1.50036630036630040), log10(1),  log10(1.51591413767579560), log10(1)
; dq log10(1.53178758414360510), log10(1),  log10(1.54682779456193350), log10(1)
; dq log10(1.56216628527841350), log10(1),  log10(1.57781201848998460), log10(1)
; dq log10(1.59377431906614800), log10(1),  log10(1.60879811468970920), log10(1)
; dq log10(1.62539682539682540), log10(1),  log10(1.64102564102564120), log10(1)
; dq log10(1.65561843168957170), log10(1),  log10(1.67183673469387760), log10(1)
; dq log10(1.68698517298187810), log10(1),  log10(1.70382695507487520), log10(1)
; dq log10(1.71812080536912770), log10(1),  log10(1.73412362404741740), log10(1)
; dq log10(1.75042735042735040), log10(1),  log10(1.76551724137931030), log10(1)
; dq log10(1.78086956521739140), log10(1),  log10(1.79649122807017550), log10(1)
; dq log10(1.81238938053097340), log10(1),  log10(1.82857142857142850), log10(1)
; dq log10(1.84338433843384330), log10(1),  log10(1.86012715712988190), log10(1)
; dq log10(1.87545787545787520), log10(1),  log10(1.89104339796860570), log10(1)
; dq log10(1.90689013035381750), log10(1),  log10(1.92120075046904320), log10(1)
; dq log10(1.93755912961210970), log10(1),  log10(1.95233555767397520), log10(1)
; dq log10(1.96923076923076930), log10(1),  log10(1.98449612403100770), log10(1)
; dq log10(2), log10(1)

But, how the value of 1.01587301587301580, 1.87545787545787520, 1.98449612403100770 were generated, i have no idea yet

daydreamer · August 19, 2020, 09:51:59 AM

what about you check cpuid for SSE3 is on cpu and have a version with horizontal add,HADDPD instead of
movupd | shufps addps
would that be few cycles faster?

guga · August 19, 2020, 11:34:08 AM

Quote from: daydreamer on August 19, 2020, 09:51:59 AM
what about you check cpuid for SSE3 is on cpu and have a version with horizontal add,HADDPD instead of
movupd | shufps addps
would that be few cycles faster?

Hi DayDreamer. probably it would be faster, but i don´t have how to assemble in SSE3 or SSE4. I didn´t have time to updated RosAsm to work with those yet. That´s why i´m using SSE2 trying to get the maximum speed of it (also could help others that don´t have newer processors)

daydreamer · August 20, 2020, 04:47:00 AM

Quote from: guga on August 19, 2020, 11:34:08 AM
Quote from: daydreamer on August 19, 2020, 09:51:59 AM
what about you check cpuid for SSE3 is on cpu and have a version with horizontal add,HADDPD instead of
movupd | shufps addps
would that be few cycles faster?

Hi DayDreamer. probably it would be faster, but i don´t have how to assemble in SSE3 or SSE4. I didn´t have time to updated RosAsm to work with those yet. That´s why i´m using SSE2 trying to get the maximum speed of it (also could help others that don´t have newer processors)

hows the macro caps for rosasm?possible to make SSE3 and SSE4 macros?
http://www.masmforum.com/board/index.php?topic=973.0

guga · August 20, 2020, 07:38:43 AM

Quote from: daydreamer on August 20, 2020, 04:47:00 AM
Quote from: guga on August 19, 2020, 11:34:08 AM
Quote from: daydreamer on August 19, 2020, 09:51:59 AM
what about you check cpuid for SSE3 is on cpu and have a version with horizontal add,HADDPD instead of
movupd | shufps addps
would that be few cycles faster?

Hi DayDreamer. probably it would be faster, but i don´t have how to assemble in SSE3 or SSE4. I didn´t have time to updated RosAsm to work with those yet. That´s why i´m using SSE2 trying to get the maximum speed of it (also could help others that don´t have newer processors)
hows the macro caps for rosasm?possible to make SSE3 and SSE4 macros?
http://www.masmforum.com/board/index.php?topic=973.0

You mean using something like this ?

Code Select


I'm having a little trouble including these macros.

This one doesn't look right:
Code:
ORPD MACRO M1,M2
    DB 066H
    ORPS MACRO M1,M2
ENDM

After fixing that, it hates these two:

Code:
CMPLTSD MACRO M1,M2
    DB 0F2H
    CMPLTPS M1,M2
END
and
Code:
CMPSD MACRO M1,M2,M3
    DB 0F2H
    CMPPS M1,M2,M3
ENDM

Never tried creating a pseudo-instruction as a macro by hand before.

In RosAsm we could hard code it with something like: DB 01 025 070

The problem is that it will be extremely hard to follow. Better would be implement SSE3 and SSE4 in RosAsm. I have to do it eventually. I just can´t implement it right now due to lack of time and several things to fix in RosAsm yet before implement SSE3/SSE4. I still need to try find some time to detach RosAsm internal code and create dlls for usage on the main tools, such as the disassembler, debugger, resources editor, forms creator and even the encoder. All of this would be better to be on their own dlls rather then a monosource as it is already.

Macros in RosAsm works inside "[" and "]" . The 1st bracket must be immediately followed by a separator "|". Like this:

[HIWORD | mov eax #1 | shr eax 16]

or the normal If Chain.

[If | cmp #1 #3 | jn#2 I1>]
[Else_if | jmp I9> | I1: | cmp #1 #3 | jn#2 I1>]
[Else | Jmp I9> | I1:]
[End_if | I1: | I9:]

Of course, this is not rigid syntax. It´s just the default macro set where the user can choose to use it or not or even write his own macro set.

Siekmanski · August 20, 2020, 08:13:58 AM

If you need a fast SSE2 horizontal addition for 4 packed floats,

Code Select

    movaps  xmm1,xmm0
    shufps  xmm1,xmm0,10110001b
    addps   xmm0,xmm1
    movhlps xmm1,xmm0
    addss   xmm0,xmm1

The MASM Forum

News:

Fast Log10 approximation

guga

Siekmanski

guga

Siekmanski

guga

jj2007

guga

guga

jj2007

guga

daydreamer

guga

daydreamer

guga

Siekmanski