News:

Masm32 SDK description, downloads and other helpful links
Message to All Guests

Main Menu

Fast Log10 approximation

Started by guga, August 16, 2020, 06:01:01 AM

Previous topic - Next topic

guga

#30
Quote from: jj2007 on August 17, 2020, 10:20:09 AM
Quote from: HSE on August 16, 2020, 07:47:28 AM
:thumbsup:

AMD A6-3500 APU with Radeon(tm) HD Graphics (SSE3)

5396    cycles for 100 * Sse2_log10_precise (Guga SSE2 Log10 precise )
11737   cycles for 100 * SmplMath:  fSlv MyReal8 = log(MyExpo)

Intel(R) Core(TM) i5-2450M CPU @ 2.50GHz (SSE4)

3163    cycles for 100 * Sse2_log10_precise (Guga SSE2 Log10 precise )
5060    cycles for 100 * Log10

3211    cycles for 100 * Sse2_log10_precise (Guga SSE2 Log10 precise )
4933    cycles for 100 * Log10

3209    cycles for 100 * Sse2_log10_precise (Guga SSE2 Log10 precise )
4823    cycles for 100 * Log10

3186    cycles for 100 * Sse2_log10_precise (Guga SSE2 Log10 precise )
4773    cycles for 100 * Log10

3219    cycles for 100 * Sse2_log10_precise (Guga SSE2 Log10 precise )
4739    cycles for 100 * Log10

536     bytes for Sse2_log10_precise (Guga SSE2 Log10 precise )
16      bytes for Log10

0.699076046207356 for Sse2_log10_precise (Guga SSE2 Log10 precise )
0.698970004336019 for Log10
0.6989700043360188 expected


Hi JJ.
I think you used the older version. The new one produces the correct value:

0.698970004336019 for Sse2_log10_precise (Guga SSE2 Log10 precise of 5)

The error is on the last 2 digits only :) Your version have less, digits then mine. So, on RosAsm, the full number (18 digits) is: 0.698970004336018857 with only the last 2 digits that looses precision.

Please, try with FastLog10a.zip i posted a few comments earlier :)

I succeeded to optimize a bit further, but i´m trying to do the proper math to optimize it even more and avoid using so many registers. I need now to optimize this part (This modification below also produces the correct result and gained extra speed):


(...)
     ; same as SSE_Place_Log0
    unpcklpd XMM2 XMM2 | mulpd XMM2 X$SSE_Log1020 | addpd XMM2 XMM3 | addpd XMM4 XMM2

    movupd XMM1 X$Log_Coeff0
    movupd XMM2 XMM0 | mulpd XMM2 XMM2 | mulsd XMM2 XMM2 | mulsd XMM2 XMM0
    mulpd XMM1 XMM0 | addpd XMM1 X$SSE_Log10Var1

    mulpd XMM1 XMM0
    addpd XMM1 X$SSE_Log10Var2 | mulpd XMM1 XMM2

    movupd xmm2 X$SSE_Log10Var3 | mulpd XMM2 XMM0

    ;--------------------
    ; all necessary data are stored in both Packed Double paisr in registers xmm4, xmm1 and xmm2. Modified original version. With shuffle is faster
    ; We only need to sum all of them
    addpd xmm1 xmm4 ; sum all double quads from xmm1 and xmm4. xmm1 = xmm1+xmm4
    addpd xmm1 xmm2 ; sum both doubles of the result above with both doubles of xmm2. xmm1 => xmm1+xmm4+xmm2
    movupd xmm0 xmm1 ; Now we need to sum both doubles of the resultrant register in xmm1. To do that we copy xmm1 to xmm0
    pshufd XMM0 XMM0 SSE_SWAP_QWORDS ; and swap their pairs of doubles. On this way we have in xmm0 and xmm1 the inverted order of the pairs. SSE_SWAP_QWORDS = 78
    addpd xmm0 xmm1 ; and we only need now to sum them up. xmm0 = xmm1 => xmm0 = xmm1+xmm4+xmm2



The math to do this is the formula below ("Lo" and "Hi" are the low and Hi parts of the Double quadword on each register or variable):

Hiquadxmm0 = xmm0.Lo^2 * ( (Log_Coeff0.Lo*xmm0.Lo+SSE_Log10Var1.Lo)* xmm0.Lo+ SSE_Log10Var2.Lo) + xmm4.Lo + (Log10Var3.Lo*xmm0.Lo)
Loquadxmm0 = xmm0.Hi^5 * ( (Log_Coeff0.Hi*xmm0.Hi+SSE_Log10Var1.Hi)* xmm0.Hi+ SSE_Log10Var2.Hi) + xmm4.Hi + (Log10Var3.Hi*xmm0.Hi)

Final xmm0 = Hiquadxmm0 + Loquadxmm0

I´m trying to recreate the above equations using less registers (forcing it to use only xmm0 and xmm1 and xmm2) and also keeping the speed, with shorter computations, but it is hard to find the proper way.

I tried this below, but the result is incorrect. I´m missing something.


;---------------------------- This is good for speed, but i´m missing some calculation, because it now produced a incorrect value.

    ;   1st step xmm4.Hi + (Log10Var3.Hi*xmm0.Hi)
    movupd xmm2 X$SSE_Log10Var3 | mulpd XMM2 XMM0 | addpd xmm2 xmm4
    ;   ( (Log_Coeff0.Hi*xmm0.Hi+SSE_Log10Var1.Hi)* xmm0.Hi+ SSE_Log10Var2.Hi)
    movupd XMM1 X$Log_Coeff0 | mulpd XMM1 XMM0 | addpd XMM1 X$SSE_Log10Var2
    ;   since xmm0.hi and xmm.lo are parto f the same xmm0 we can simply multiply this to get
    ;   Lo and Hi Quad xmm0. For the Lo Quad we must then simply mul by Loquad mnore 3 times to get xmm0.Hi^5
    mulpd XMM1 XMM0 | mulpd XMM1 XMM0
    ;   ok, now we get the Hi Part xmm0.Lo^2 * ( (Log_Coeff0.Lo*xmm0.Lo+SSE_Log10Var1.Lo)* xmm0.Lo+ SSE_Log10Var2.Lo)
    ;   To we get the LoPart we do:
    mulsd xmm1 xmm0 | mulsd xmm1 xmm0 | mulsd xmm1 xmm0
    ;   Now we can finally add xmm2 to xmm1
    addpd xmm1 xmm2
    ;   And finally we Exchange the data to add both hi and lo quads
    movupd xmm0 xmm1 ; Now we need to sum both doubles of the resultrant register in xmm1. To do that we copy xmm1 to xmm0
    pshufd XMM0 XMM0 SSE_SWAP_QWORDS ; and swap their pairs of doubles. On this way we have in xmm0 and xmm1 the inverted order of the pairs
    addpd xmm0 xmm1 ; and we only need now to sum them up. xmm0 = xmm1 => xmm0 = xmm1+xmm4+xmm2
Coding in Assembly requires a mix of:
80% of brain, passion, intuition, creativity
10% of programming skills
10% of alcoholic levels in your blood.

My Code Sites:
http://rosasm.freeforums.org
http://winasm.tripod.com

Siekmanski

Hi Guga,

You can create a reciprocal with these SSE2 instructions:
rcpps - Approximates the reciprocal of 4 packed floats.
rcpss - Approximates the reciprocal of a single float.
Creative coders use backward thinking techniques as a strategy.

guga

Hi marinus.

Great  :thumbsup: :thumbsup: :thumbsup: :thumbsup:

But, how to apply it on the previous code to we minimize the usage of so many registers and make it shorter and faster ? (Mine version does not use divisions)

I´m trying to simplify even further this:

Hiquadxmm0 = xmm0.Lo^2 * ( (Log_Coeff0.Lo*xmm0.Lo+SSE_Log10Var1.Lo)* xmm0.Lo+ SSE_Log10Var2.Lo) + xmm4.Lo + (Log10Var3.Lo*xmm0.Lo)
Loquadxmm0 = xmm0.Hi^5 * ( (Log_Coeff0.Hi*xmm0.Hi+SSE_Log10Var1.Hi)* xmm0.Hi+ SSE_Log10Var2.Hi) + xmm4.Hi + (Log10Var3.Hi*xmm0.Hi)
Final xmm0 = Hiquadxmm0 + Loquadxmm0

But, didn´t suceeded yet
Coding in Assembly requires a mix of:
80% of brain, passion, intuition, creativity
10% of programming skills
10% of alcoholic levels in your blood.

My Code Sites:
http://rosasm.freeforums.org
http://winasm.tripod.com

Siekmanski

Just noticed you use real8.

unfortunately there are no rcppd and rcpsd instructions.
Creative coders use backward thinking techniques as a strategy.

guga

Ok, I think i got it working as expected.

I had to do this:

;    The formula to retrieve xmm0 is as follows:
;        Hiquadxmm0 = xmm0.Lo^2 * ( (Log_Coeff0.Lo*xmm0.Lo+SSE_Log10Var1.Lo)* xmm0.Lo+ SSE_Log10Var2.Lo) + xmm4.Lo + (Log10Var3.Lo*xmm0.Lo)
;        Loquadxmm0 = xmm0.Hi^5 * ( (Log_Coeff0.Hi*xmm0.Hi+SSE_Log10Var1.Hi)* xmm0.Hi+ SSE_Log10Var2.Hi) + xmm4.Hi + (Log10Var3.Hi*xmm0.Hi)
;        Final xmm0 = Hiquadxmm0 + Loquadxmm0

    ;   1st step we calculate low and hi values of xmm4.Hi + (Log10Var3.Hi*xmm0.Hi) and xmm4.Lo + (Log10Var3.Lo*xmm0.Lo)
    movupd xmm2 X$SSE_Log10Var3 | mulpd XMM2 XMM0 | addpd xmm2 xmm4 ; Save the result in xmm2

    ; Now. What do both have in common ? Both have in common this:
    ; xmm0.Lo * ( (Log_Coeff0.Lo*xmm0.Lo+SSE_Log10Var1.Lo)* xmm0.Lo+ SSE_Log10Var2.Lo)
    ; xmm0.Hi * ( (Log_Coeff0.Hi*xmm0.Hi+SSE_Log10Var1.Hi)* xmm0.Hi+ SSE_Log10Var2.Hi)
    ; So we compute this only once and save it on a xmm1 register to we get both low and hi values

    ;   ( (Log_Coeff0.Hi*xmm0.Hi+SSE_Log10Var1.Hi)* xmm0.Hi+ SSE_Log10Var2.Hi)
    movupd XMM1 X$Log_Coeff0 | mulpd XMM1 XMM0 | addpd XMM1 X$SSE_Log10Var1 | mulpd xmm1 xmm0 | addpd XMM1 X$SSE_Log10Var2

    ; now we get the LoQuad xmm0.Lo^2 * ( (Log_Coeff0.Lo*xmm0.Lo+SSE_Log10Var1.Lo)* xmm0.Lo+ SSE_Log10Var2.Lo)
    ; and the hi quad xmm0.Lo^2 * ( (Log_Coeff0.Lo*xmm0.Lo+SSE_Log10Var1.Lo)* xmm0.Lo+ SSE_Log10Var2.Lo)
    ; at once. (Because they share the same values of xmm0.Lo^2)
    mulpd xmm1 xmm0 | mulpd xmm1 xmm0; xmm0.Lo^2 is calculated and applied in both low and hi doubles of xmm1

    ; let´ now do the same for xmm0.Hi^5
    ; since we already calculated xmm0.Hi^2, we need only to multiply 3 times with xmm0 to we get the value of Loquadxmm0
    mulsd xmm1 xmm0 | mulsd xmm1 xmm0 | mulsd xmm1 xmm0

    ;   Now we can finally add xmm2 to xmm1
    addpd xmm1 xmm2

    ;   And finally we Exchange the data to add both hi and lo quads

    movupd xmm0 xmm1 ; Now we need to sum both doubles of the resultrant register in xmm1. To do that we copy xmm1 to xmm0
    pshufd xmm0 xmm0 SSE_SWAP_QWORDS ; and swap their pairs of doubles. On this way we have in xmm0 and xmm1 the inverted order of the pairs. This is a simple equate whose value is 78 to perform the swap from pshufd
    addpd xmm0 xmm1 ; and we only need now to sum them up. xmm0 = xmm1 => xmm0 = xmm1+xmm4+xmm2


Although i removed extra registers, on this part of the code, it still need 17 instructions to calculate the final result at xmm0

I wonder if this can be optimized even further in SSE2
Coding in Assembly requires a mix of:
80% of brain, passion, intuition, creativity
10% of programming skills
10% of alcoholic levels in your blood.

My Code Sites:
http://rosasm.freeforums.org
http://winasm.tripod.com

jj2007

Quote from: guga on August 17, 2020, 05:20:27 PMHi JJ.
I think you used the older version. The new one produces the correct value:

0.698970004336019 for Sse2_log10_precise (Guga SSE2 Log10 precise of 5)

The error is on the last 2 digits only :) Your version have less, digits then mine. So, on RosAsm, the full number (18 digits) is: 0.698970004336018857 with only the last 2 digits that looses precision.

Please, try with FastLog10a.zip i posted a few comments earlier :)

The only changes necessary are useC=0 and in TestB, line 357:
   SetFloat MyReal8=Log10(MyExpo)
instead of ExpXY(MyBaseB, MyExpo, MyReal8)

Intel(R) Core(TM) i5-2450M CPU @ 2.50GHz (SSE4)

3157    cycles for 100 * Sse2_log10_precise (Guga SSE2 Log10 precise of 5)
4795    cycles for 100 * MasmBasic Log10

3166    cycles for 100 * Sse2_log10_precise (Guga SSE2 Log10 precise of 5)
4830    cycles for 100 * MasmBasic Log10

3145    cycles for 100 * Sse2_log10_precise (Guga SSE2 Log10 precise of 5)
4772    cycles for 100 * MasmBasic Log10

3212    cycles for 100 * Sse2_log10_precise (Guga SSE2 Log10 precise of 5)
4767    cycles for 100 * MasmBasic Log10

3145    cycles for 100 * Sse2_log10_precise (Guga SSE2 Log10 precise of 5)
4765    cycles for 100 * MasmBasic Log10

0.698970004336019 for Sse2_log10_precise (Guga SSE2 Log10 precise of 5)
0.698970004336019 for MasmBasic Log10


We are close :smiley:

As regards precision:

include \masm32\MasmBasic\MasmBasic.inc         ; download
  SetGlobals MyReal10:REAL10
  Init
  PrintLine "0.698970004336018804786 (expected)"
  SetFloat MyReal10=Log10(5.0)
  Print Str$(MyReal10)
EndOfCode


0.698970004336018804786 (expected)
0.6989700043360188048

guga

Quote from: jj2007 on August 17, 2020, 09:17:56 PM
Quote from: guga on August 17, 2020, 05:20:27 PMHi JJ.
I think you used the older version. The new one produces the correct value:

0.698970004336019 for Sse2_log10_precise (Guga SSE2 Log10 precise of 5)

The error is on the last 2 digits only :) Your version have less, digits then mine. So, on RosAsm, the full number (18 digits) is: 0.698970004336018857 with only the last 2 digits that looses precision.

Please, try with FastLog10a.zip i posted a few comments earlier :)

The only changes necessary are useC=0 and in TestB, line 357:
   SetFloat MyReal8=Log10(MyExpo)
instead of ExpXY(MyBaseB, MyExpo, MyReal8)

Intel(R) Core(TM) i5-2450M CPU @ 2.50GHz (SSE4)

3157    cycles for 100 * Sse2_log10_precise (Guga SSE2 Log10 precise of 5)
4795    cycles for 100 * MasmBasic Log10

3166    cycles for 100 * Sse2_log10_precise (Guga SSE2 Log10 precise of 5)
4830    cycles for 100 * MasmBasic Log10

3145    cycles for 100 * Sse2_log10_precise (Guga SSE2 Log10 precise of 5)
4772    cycles for 100 * MasmBasic Log10

3212    cycles for 100 * Sse2_log10_precise (Guga SSE2 Log10 precise of 5)
4767    cycles for 100 * MasmBasic Log10

3145    cycles for 100 * Sse2_log10_precise (Guga SSE2 Log10 precise of 5)
4765    cycles for 100 * MasmBasic Log10

0.698970004336019 for Sse2_log10_precise (Guga SSE2 Log10 precise of 5)
0.698970004336019 for MasmBasic Log10


We are close :smiley:

As regards precision:

include \masm32\MasmBasic\MasmBasic.inc         ; download
  SetGlobals MyReal10:REAL10
  Init
  PrintLine "0.698970004336018804786 (expected)"
  SetFloat MyReal10=Log10(5.0)
  Print Str$(MyReal10)
EndOfCode


0.698970004336018804786 (expected)
0.6989700043360188048


Great !. Thanks, JJ. I´ll change the code and compare to yours. I´m trying to see if i can optimize a bit further before post a newer version :)

Coding in Assembly requires a mix of:
80% of brain, passion, intuition, creativity
10% of programming skills
10% of alcoholic levels in your blood.

My Code Sites:
http://rosasm.freeforums.org
http://winasm.tripod.com

guga

New version.

A bit faster (Something around 10% faster). :thumbsup:

I think i reached my limit of optimization :bgrin: :bgrin:. If someone wants to give a try and optimize further, please do :greensml: :greensml: :greensml:

Accuracy not affected by the current optimization.

Precision is:
16 digits after the "." for normalized values
12 digits after the "." for denormalized values

My timmings:


AMD Ryzen 5 2400G with Radeon Vega Graphics     (SSE4)

2640    cycles for 100 * Sse2_log10_precise (Guga SSE2 Log10 precise of 5)
6699    cycles for 100 * Log10 (MasmBasic, Log10 of 5)

2507    cycles for 100 * Sse2_log10_precise (Guga SSE2 Log10 precise of 5)
6765    cycles for 100 * Log10 (MasmBasic, Log10 of 5)

2471    cycles for 100 * Sse2_log10_precise (Guga SSE2 Log10 precise of 5)
6774    cycles for 100 * Log10 (MasmBasic, Log10 of 5)

2516    cycles for 100 * Sse2_log10_precise (Guga SSE2 Log10 precise of 5)
6785    cycles for 100 * Log10 (MasmBasic, Log10 of 5)

2569    cycles for 100 * Sse2_log10_precise (Guga SSE2 Log10 precise of 5)
6600    cycles for 100 * Log10 (MasmBasic, Log10 of 5)

0.698970004336019 for Sse2_log10_precise (Guga SSE2 Log10 precise of 5)
0.698970004336019 for Log10 (MasmBasic, Log10 of 5)

--- ok ---




Updated version also on the 1st post
Coding in Assembly requires a mix of:
80% of brain, passion, intuition, creativity
10% of programming skills
10% of alcoholic levels in your blood.

My Code Sites:
http://rosasm.freeforums.org
http://winasm.tripod.com

jj2007

 :thumbsup:
Intel(R) Core(TM) i5-2450M CPU @ 2.50GHz (SSE4)

2867    cycles for 100 * Sse2_log10_precise (Guga SSE2 Log10 precise of 5)
4731    cycles for 100 * Log10 (MasmBasic, Log10 of 5)

2860    cycles for 100 * Sse2_log10_precise (Guga SSE2 Log10 precise of 5)
4780    cycles for 100 * Log10 (MasmBasic, Log10 of 5)

2834    cycles for 100 * Sse2_log10_precise (Guga SSE2 Log10 precise of 5)
4775    cycles for 100 * Log10 (MasmBasic, Log10 of 5)

2878    cycles for 100 * Sse2_log10_precise (Guga SSE2 Log10 precise of 5)
4760    cycles for 100 * Log10 (MasmBasic, Log10 of 5)

2858    cycles for 100 * Sse2_log10_precise (Guga SSE2 Log10 precise of 5)
4742    cycles for 100 * Log10 (MasmBasic, Log10 of 5)

0.698970004336019 for Sse2_log10_precise (Guga SSE2 Log10 precise of 5)
0.698970004336019 for Log10 (MasmBasic, Log10 of 5)

guga

Great !

Small improve.
Replace the cmp     ecx, 1023; with a test ecx (not 1023), like this:

test ecx, 0FFFFFC00h
jz     loc_4C6BFC
;                 cmp     ecx, 1023; Check for special cases (NAN, INF etc). Otherwise, jmp to the start of computation
;                 jbe     loc_4C6BFC


It should speed up more 2 or 3 % :bgrin:


Btw...take a look at the source. I commented it a little bit and also changed the Tables to hold the proper values. Log10_Table_T  and SSE_Log1020 . Now the 2nd Real8 on both tables can be zeroed and later improve a bit more using a array of 2 Real 8 instead of 4 Real8 values on table Log10_Table_T  and Log10_Table_B and perhaps using mulsd in some parts of teh code rather mulpd. Didn´t checked this part yet, but i guess that it is fast and precise enough already :)

Log10_Table_T table is formed by the log10(x) where x = 1 to 2

I didn´t found out yet how those values where generated, but it was calculated as follow:

; Log10_Table_T is formed by a log10(x) where x is the same as below:
; dq log10(1), log10(1), log10(1.01587301587301580), log10(1)
; dq log10(1.03121852970795570), log10(1),  log10(1.04703476482617600), log10(1)
; dq log10(1.06224066390041490), log10(1),  log10(1.07789473684210520), log10(1)
; dq log10(1.09401709401709410), log10(1),  log10(1.10942578548212340), log10(1)
; dq log10(1.12527472527472530), log10(1),  log10(1.14031180400890860), log10(1)
; dq log10(1.15640880858272150), log10(1),  log10(1.17162471395881010), log10(1)
; dq log10(1.18724637681159420), log10(1),  log10(1.20329024676850760), log10(1)
; dq log10(1.21904761904761890), log10(1),  log10(1.23447860156720910), log10(1)
; dq log10(1.25030525030525030), log10(1),  log10(1.26576019777503080), log10(1)
; dq log10(1.28160200250312890), log10(1),  log10(1.29702343255224830), log10(1)
; dq log10(1.31282051282051280), log10(1),  log10(1.32814526588845650), log10(1)
; dq log10(1.34383202099737530), log10(1),  log10(1.35899137358991370), log10(1)
; dq log10(1.37541974479516460), log10(1),  log10(1.39035980991174470), log10(1)
; dq log10(1.40659340659340670), log10(1),  log10(1.42222222222222210), log10(1)
; dq log10(1.43719298245614050), log10(1),  log10(1.45351312987934710), log10(1)
; dq log10(1.46915351506456250), log10(1),  log10(1.48405797101449280), log10(1)
; dq log10(1.50036630036630040), log10(1),  log10(1.51591413767579560), log10(1)
; dq log10(1.53178758414360510), log10(1),  log10(1.54682779456193350), log10(1)
; dq log10(1.56216628527841350), log10(1),  log10(1.57781201848998460), log10(1)
; dq log10(1.59377431906614800), log10(1),  log10(1.60879811468970920), log10(1)
; dq log10(1.62539682539682540), log10(1),  log10(1.64102564102564120), log10(1)
; dq log10(1.65561843168957170), log10(1),  log10(1.67183673469387760), log10(1)
; dq log10(1.68698517298187810), log10(1),  log10(1.70382695507487520), log10(1)
; dq log10(1.71812080536912770), log10(1),  log10(1.73412362404741740), log10(1)
; dq log10(1.75042735042735040), log10(1),  log10(1.76551724137931030), log10(1)
; dq log10(1.78086956521739140), log10(1),  log10(1.79649122807017550), log10(1)
; dq log10(1.81238938053097340), log10(1),  log10(1.82857142857142850), log10(1)
; dq log10(1.84338433843384330), log10(1),  log10(1.86012715712988190), log10(1)
; dq log10(1.87545787545787520), log10(1),  log10(1.89104339796860570), log10(1)
; dq log10(1.90689013035381750), log10(1),  log10(1.92120075046904320), log10(1)
; dq log10(1.93755912961210970), log10(1),  log10(1.95233555767397520), log10(1)
; dq log10(1.96923076923076930), log10(1),  log10(1.98449612403100770), log10(1)
; dq log10(2), log10(1)


But, how the value of 1.01587301587301580, 1.87545787545787520, 1.98449612403100770 were generated, i have no idea yet
Coding in Assembly requires a mix of:
80% of brain, passion, intuition, creativity
10% of programming skills
10% of alcoholic levels in your blood.

My Code Sites:
http://rosasm.freeforums.org
http://winasm.tripod.com

daydreamer

what about you check cpuid for SSE3 is on cpu and have a version with horizontal add,HADDPD instead of
movupd | shufps addps
would that be few cycles faster?
my none asm creations
https://masm32.com/board/index.php?topic=6937.msg74303#msg74303
I am an Invoker
"An Invoker is a mage who specializes in the manipulation of raw and elemental energies."
Like SIMD coding

guga

Quote from: daydreamer on August 19, 2020, 09:51:59 AM
what about you check cpuid for SSE3 is on cpu and have a version with horizontal add,HADDPD instead of
movupd | shufps addps
would that be few cycles faster?

Hi DayDreamer. probably it would be faster, but i don´t have how to assemble in SSE3 or SSE4. I didn´t have time to updated RosAsm to work with those yet. That´s why i´m using SSE2 trying to get the maximum speed of it (also could help others that don´t have newer processors)
Coding in Assembly requires a mix of:
80% of brain, passion, intuition, creativity
10% of programming skills
10% of alcoholic levels in your blood.

My Code Sites:
http://rosasm.freeforums.org
http://winasm.tripod.com

daydreamer

Quote from: guga on August 19, 2020, 11:34:08 AM
Quote from: daydreamer on August 19, 2020, 09:51:59 AM
what about you check cpuid for SSE3 is on cpu and have a version with horizontal add,HADDPD instead of
movupd | shufps addps
would that be few cycles faster?

Hi DayDreamer. probably it would be faster, but i don´t have how to assemble in SSE3 or SSE4. I didn´t have time to updated RosAsm to work with those yet. That´s why i´m using SSE2 trying to get the maximum speed of it (also could help others that don´t have newer processors)
hows the macro caps for rosasm?possible to make SSE3 and SSE4 macros?
http://www.masmforum.com/board/index.php?topic=973.0
my none asm creations
https://masm32.com/board/index.php?topic=6937.msg74303#msg74303
I am an Invoker
"An Invoker is a mage who specializes in the manipulation of raw and elemental energies."
Like SIMD coding

guga

Quote from: daydreamer on August 20, 2020, 04:47:00 AM
Quote from: guga on August 19, 2020, 11:34:08 AM
Quote from: daydreamer on August 19, 2020, 09:51:59 AM
what about you check cpuid for SSE3 is on cpu and have a version with horizontal add,HADDPD instead of
movupd | shufps addps
would that be few cycles faster?

Hi DayDreamer. probably it would be faster, but i don´t have how to assemble in SSE3 or SSE4. I didn´t have time to updated RosAsm to work with those yet. That´s why i´m using SSE2 trying to get the maximum speed of it (also could help others that don´t have newer processors)
hows the macro caps for rosasm?possible to make SSE3 and SSE4 macros?
http://www.masmforum.com/board/index.php?topic=973.0

You mean using something like this ?


I'm having a little trouble including these macros.

This one doesn't look right:
Code:
ORPD MACRO M1,M2
    DB 066H
    ORPS MACRO M1,M2
ENDM

After fixing that, it hates these two:

Code:
CMPLTSD MACRO M1,M2
    DB 0F2H
    CMPLTPS M1,M2
END
and
Code:
CMPSD MACRO M1,M2,M3
    DB 0F2H
    CMPPS M1,M2,M3
ENDM




Never tried creating a pseudo-instruction as a macro by hand before.

In RosAsm we could hard code it with something like: DB  01  025  070

The problem is that it will be extremely hard to follow. Better would be implement SSE3 and SSE4 in RosAsm. I have to do it eventually. I just can´t implement it right now due to lack of time and several things to fix in RosAsm yet before implement SSE3/SSE4. I still need to try find some time to detach RosAsm internal code and create dlls for usage on the main tools, such as the disassembler, debugger, resources editor, forms creator and even the encoder. All of this would be better to be on their own dlls rather then a monosource as it is already.


Macros in RosAsm works inside "[" and "]" . The 1st bracket must be immediately followed by a separator "|". Like this:

[HIWORD | mov eax #1 | shr eax 16]

or the normal If Chain.

[If | cmp #1 #3 | jn#2 I1>]
[Else_if | jmp I9> | I1: | cmp #1 #3 | jn#2 I1>]
[Else | Jmp I9> | I1:]
[End_if | I1: | I9:]

Of course, this is not rigid syntax. It´s just the default macro set where the user can choose to use it or not or even write his own macro set.
Coding in Assembly requires a mix of:
80% of brain, passion, intuition, creativity
10% of programming skills
10% of alcoholic levels in your blood.

My Code Sites:
http://rosasm.freeforums.org
http://winasm.tripod.com

Siekmanski

If you need a fast SSE2 horizontal addition for 4 packed floats,

    movaps  xmm1,xmm0
    shufps  xmm1,xmm0,10110001b
    addps   xmm0,xmm1
    movhlps xmm1,xmm0
    addss   xmm0,xmm1
Creative coders use backward thinking techniques as a strategy.