News:

Masm32 SDK description, downloads and other helpful links
Message to All Guests

Main Menu

FPU power

Started by mabdelouahab, October 06, 2015, 02:19:01 AM

Previous topic - Next topic

rrr314159

print real8 variables with, for instance "%.4g", there are many other format options available also. Both real4 and real10 are not so easy. If you don't care about the extra precision of real10 simplest would be to stick with real8, just convert all the 10's to 8's in dedndave's code
I am NaN ;)

mabdelouahab

I Thank you rrr314159, It works  :t

rrr314159

You're welcome it's a privilege to help such a hard worker as yourself

@Siekmanski et al,

It occurred to me, fyl2x and f2xm1 are similar to fsin; they use a power series to compute the function. And fsin is incredibly slow on Intel, can be beat by factor of 100, as we found out in the Trigonometry thread in Laboratory. Checked Agner, turns out these are also very slow! (On Prescott particularly, probably applies to modern Intels). So should be easy to go much faster. Essentially that would entail going back to the way we did it in the good old days, roll-your-own power series; PITA, but worth it if you need speed. If I get around to it I'll start a Laboratory thread on the topic. Of course at my age that might take a year, anyone else is welcome to look into it
I am NaN ;)

dedndave

the REAL10 proc can be easily modifed to REAL8
i use REAL10's because that's the "native" size for the FPU, internally

notice that, when you load REAL8's into the FPU, it converts them to REAL10's internally
hopefully, the result is in range when it saves the result to a REAL8 when done   :P

f8Exp PROTO :LPVOID,:LPVOID,:LPVOID

;***********************************************************************************************

        OPTION  PROLOGUE:None
        OPTION  EPILOGUE:None

f8Exp PROC lpResult:LPVOID,lpBase:LPVOID,lpExponent:LPVOID

;0.0 raised to any exponent returns an indefinate qNaN
;any non-zero base raised to 0.0 returns 1.0

    mov     ecx,[esp+12]        ;ECX = lpExponent
    mov     edx,[esp+8]         ;EDX = lpBase
    mov     eax,[esp+4]         ;EAX = lpResult
    fld real8 ptr [ecx]
    fld real8 ptr [edx]
    fyl2x
    fld     st
    frndint
    fsub    st(1),st
    fxch    st(1)
    f2xm1
    fld1
    fadd
    fscale
    fstp    st(1)
    fstp real8 ptr [eax]
    ret     12

f8Exp ENDP

        OPTION  PROLOGUE:PrologueDef
        OPTION  EPILOGUE:EpilogueDef

;***********************************************************************************************

Siekmanski

It would be a nice challenge to come up with a faster power function.
Wish I had more time to give it a try.....
Creative coders use backward thinking techniques as a strategy.

jj2007

Quote from: Siekmanski on October 06, 2015, 08:28:12 PM
It would be a nice challenge to come up with a faster power function.

Testbed attached :biggrin:

Intel(R) Core(TM) i5-2450M CPU @ 2.50GHz (SSE4)
24947   cycles for 100 * f8Exp (Dave)
12338   cycles for 100 * ExpXY (MasmBasic)
25097   cycles for 100 * pow (CRT)

24944   cycles for 100 * f8Exp (Dave)
12329   cycles for 100 * ExpXY (MasmBasic)
24983   cycles for 100 * pow (CRT)

24959   cycles for 100 * f8Exp (Dave)
12298   cycles for 100 * ExpXY (MasmBasic)
24983   cycles for 100 * pow (CRT)

24947   cycles for 100 * f8Exp (Dave)
12345   cycles for 100 * ExpXY (MasmBasic)
24990   cycles for 100 * pow (CRT)

24957   cycles for 100 * f8Exp (Dave)
12328   cycles for 100 * ExpXY (MasmBasic)
24972   cycles for 100 * pow (CRT)

1.41421356237310 for f8Exp (Dave, 2^0.5)
1.73205080756888 for ExpXY (MasmBasic, 3^0.5)
2.00000000000000 for pow (CRT, 4^0.5)

TWell

AMD E-450 APU with Radeon(tm) HD Graphics (SSE4)

20886   cycles for 100 * f8Exp (Dave, 2^0.5)
18506   cycles for 100 * ExpXY (MasmBasic, 3^0.5)
30764   cycles for 100 * pow (CRT, 4^0.5)

20947   cycles for 100 * f8Exp (Dave, 2^0.5)
18486   cycles for 100 * ExpXY (MasmBasic, 3^0.5)
31182   cycles for 100 * pow (CRT, 4^0.5)

21028   cycles for 100 * f8Exp (Dave, 2^0.5)
18480   cycles for 100 * ExpXY (MasmBasic, 3^0.5)
30621   cycles for 100 * pow (CRT, 4^0.5)

21278   cycles for 100 * f8Exp (Dave, 2^0.5)
18486   cycles for 100 * ExpXY (MasmBasic, 3^0.5)
30769   cycles for 100 * pow (CRT, 4^0.5)

20891   cycles for 100 * f8Exp (Dave, 2^0.5)
18572   cycles for 100 * ExpXY (MasmBasic, 3^0.5)
30608   cycles for 100 * pow (CRT, 4^0.5)

1.41421356237310 for f8Exp (Dave, 2^0.5)
1.73205080756888 for ExpXY (MasmBasic, 3^0.5)
2.00000000000000 for pow (CRT, 4^0.5)

gelatine1


Intel(R) Core(TM) i5-2430M CPU @ 2.40GHz (SSE4)

25147   cycles for 100 * f8Exp (Dave, 2^0.5)
12421   cycles for 100 * ExpXY (MasmBasic, 3^0.5)
25152   cycles for 100 * pow (CRT, 4^0.5)

25197   cycles for 100 * f8Exp (Dave, 2^0.5)
12404   cycles for 100 * ExpXY (MasmBasic, 3^0.5)
25155   cycles for 100 * pow (CRT, 4^0.5)

25099   cycles for 100 * f8Exp (Dave, 2^0.5)
12393   cycles for 100 * ExpXY (MasmBasic, 3^0.5)
25162   cycles for 100 * pow (CRT, 4^0.5)

25092   cycles for 100 * f8Exp (Dave, 2^0.5)
12388   cycles for 100 * ExpXY (MasmBasic, 3^0.5)
25194   cycles for 100 * pow (CRT, 4^0.5)

25115   cycles for 100 * f8Exp (Dave, 2^0.5)
12417   cycles for 100 * ExpXY (MasmBasic, 3^0.5)
25116   cycles for 100 * pow (CRT, 4^0.5)

1.41421356237310 for f8Exp (Dave, 2^0.5)
1.73205080756888 for ExpXY (MasmBasic, 3^0.5)
2.00000000000000 for pow (CRT, 4^0.5)

Siekmanski

Intel(R) Core(TM) i7-4930K CPU @ 3.40GHz (SSE4)

25032   cycles for 100 * f8Exp (Dave, 2^0.5)
12318   cycles for 100 * ExpXY (MasmBasic, 3^0.5)
25215   cycles for 100 * pow (CRT, 4^0.5)

25034   cycles for 100 * f8Exp (Dave, 2^0.5)
12314   cycles for 100 * ExpXY (MasmBasic, 3^0.5)
25169   cycles for 100 * pow (CRT, 4^0.5)

25055   cycles for 100 * f8Exp (Dave, 2^0.5)
12332   cycles for 100 * ExpXY (MasmBasic, 3^0.5)
25148   cycles for 100 * pow (CRT, 4^0.5)

25066   cycles for 100 * f8Exp (Dave, 2^0.5)
12312   cycles for 100 * ExpXY (MasmBasic, 3^0.5)
25172   cycles for 100 * pow (CRT, 4^0.5)

25048   cycles for 100 * f8Exp (Dave, 2^0.5)
12302   cycles for 100 * ExpXY (MasmBasic, 3^0.5)
25132   cycles for 100 * pow (CRT, 4^0.5)

1.41421356237310 for f8Exp (Dave, 2^0.5)
1.73205080756888 for ExpXY (MasmBasic, 3^0.5)
2.00000000000000 for pow (CRT, 4^0.5)

--- ok ---
Creative coders use backward thinking techniques as a strategy.

rrr314159

Thanks jj,

Obvious question, why is ExpXY faster? Aren't you (MasmBasic) using the same instructions: fyl2x and f2xm1; fprem or frndint; and fscale? If instead you're using a traditional algo (pre- fyl2x and f2xm1) that may explain it.

Another obvious one, why different test cases for the three "Units Under Test": square roots of 2, 3, and 4? Judging by the numbers it's not a problem. For some implementations 4^0.5 might go a lot faster, but obviously not with CRT pow. I guess we can suppose any inputs will go at the same speed.

Evidently ExpXY gets less advantage (tho still ahead) with AMD, which (judging by fsin) may have much better implementations of fyl2x and f2xm1. That would make sense if you're using a traditional algo which implements its own power series (for ln and exp)

AMD A6-6310 APU with AMD Radeon R4 Graphics     (SSE4)

18776   cycles for 100 * f8Exp (Dave, 2^0.5)
17133   cycles for 100 * ExpXY (MasmBasic, 3^0.5)
22233   cycles for 100 * pow (CRT, 4^0.5)

18549   cycles for 100 * f8Exp (Dave, 2^0.5)
17595   cycles for 100 * ExpXY (MasmBasic, 3^0.5)
22059   cycles for 100 * pow (CRT, 4^0.5)

18576   cycles for 100 * f8Exp (Dave, 2^0.5)
17082   cycles for 100 * ExpXY (MasmBasic, 3^0.5)
22169   cycles for 100 * pow (CRT, 4^0.5)

18787   cycles for 100 * f8Exp (Dave, 2^0.5)
17142   cycles for 100 * ExpXY (MasmBasic, 3^0.5)
22370   cycles for 100 * pow (CRT, 4^0.5)

18554   cycles for 100 * f8Exp (Dave, 2^0.5)
17172   cycles for 100 * ExpXY (MasmBasic, 3^0.5)
22242   cycles for 100 * pow (CRT, 4^0.5)

1.41421356237310 for f8Exp (Dave, 2^0.5)
1.73205080756888 for ExpXY (MasmBasic, 3^0.5)
2.00000000000000 for pow (CRT, 4^0.5)

--- ok ---


I am NaN ;)

jj2007

Quote from: rrr314159 on October 07, 2015, 12:31:15 AMI guess we can suppose any inputs will go at the same speed.

That's an interesting hypothesis (and it was mine, too), but unfortunately it is strongly at odds with reality 8)

rrr314159

So ... wouldn't it be better to test them with the same inputs? Not that I want to make extra work for you. Only 4^0.5 ought to go faster, and since it's used with the slowest algo, the results should be good anyway
I am NaN ;)

mabdelouahab

Intel(R) Core(TM) i5-4210U CPU @ 1.70GHz (SSE4)

28576   cycles for 100 * f8Exp (Dave, 2^0.5)
13867   cycles for 100 * ExpXY (MasmBasic, 3^0.5)
30481   cycles for 100 * pow (CRT, 4^0.5)

28425   cycles for 100 * f8Exp (Dave, 2^0.5)
13873   cycles for 100 * ExpXY (MasmBasic, 3^0.5)
27419   cycles for 100 * pow (CRT, 4^0.5)

28460   cycles for 100 * f8Exp (Dave, 2^0.5)
13872   cycles for 100 * ExpXY (MasmBasic, 3^0.5)
27390   cycles for 100 * pow (CRT, 4^0.5)

28419   cycles for 100 * f8Exp (Dave, 2^0.5)
13830   cycles for 100 * ExpXY (MasmBasic, 3^0.5)
27393   cycles for 100 * pow (CRT, 4^0.5)

28441   cycles for 100 * f8Exp (Dave, 2^0.5)
13849   cycles for 100 * ExpXY (MasmBasic, 3^0.5)
27396   cycles for 100 * pow (CRT, 4^0.5)

1.41421356237310 for f8Exp (Dave, 2^0.5)
1.73205080756888 for ExpXY (MasmBasic, 3^0.5)
2.00000000000000 for pow (CRT, 4^0.5)

--- ok ---

jj2007

Quote from: rrr314159 on October 07, 2015, 12:45:57 AM
So ... wouldn't it be better to test them with the same inputs? Not that I want to make extra work for you. Only 4^0.5 ought to go faster, and since it's used with the slowest algo, the results should be good anyway

Here is one with almost identical values. The little difference is to verify that the algo actually changed the variable...

Intel(R) Core(TM) i5-2450M CPU @ 2.50GHz (SSE4)

14648   cycles for 100 * f8Exp (Dave, 5.4321^0.5)
12378   cycles for 100 * ExpXY (MasmBasic, 5.4322^0.5)
24845   cycles for 100 * pow (CRT, 5.4323^0.5)

14645   cycles for 100 * f8Exp (Dave, 5.4321^0.5)
12376   cycles for 100 * ExpXY (MasmBasic, 5.4322^0.5)
24840   cycles for 100 * pow (CRT, 5.4323^0.5)

14656   cycles for 100 * f8Exp (Dave, 5.4321^0.5)
12378   cycles for 100 * ExpXY (MasmBasic, 5.4322^0.5)
24854   cycles for 100 * pow (CRT, 5.4323^0.5)

14656   cycles for 100 * f8Exp (Dave, 5.4321^0.5)
12373   cycles for 100 * ExpXY (MasmBasic, 5.4322^0.5)
24837   cycles for 100 * pow (CRT, 5.4323^0.5)

14642   cycles for 100 * f8Exp (Dave, 5.4321^0.5)
12374   cycles for 100 * ExpXY (MasmBasic, 5.4322^0.5)
24848   cycles for 100 * pow (CRT, 5.4323^0.5)

5.43210000000000  for f8Exp (Dave, 5.4321^0.5)^2
5.43220000000000  for ExpXY (MasmBasic, 5.4322^0.5)^2
5.43230000000000  for pow (CRT, 5.4323^0.5)^2

TWell

AMD E-450 APU with Radeon(tm) HD Graphics (SSE4)

21041   cycles for 100 * f8Exp (Dave, 5.4321^0.5)
23269   cycles for 100 * ExpXY (MasmBasic, 5.4322^0.5)
32989   cycles for 100 * pow (CRT, 5.4323^0.5)

20927   cycles for 100 * f8Exp (Dave, 5.4321^0.5)
23154   cycles for 100 * ExpXY (MasmBasic, 5.4322^0.5)
33051   cycles for 100 * pow (CRT, 5.4323^0.5)

20892   cycles for 100 * f8Exp (Dave, 5.4321^0.5)
23173   cycles for 100 * ExpXY (MasmBasic, 5.4322^0.5)
33151   cycles for 100 * pow (CRT, 5.4323^0.5)

20886   cycles for 100 * f8Exp (Dave, 5.4321^0.5)
23342   cycles for 100 * ExpXY (MasmBasic, 5.4322^0.5)
33086   cycles for 100 * pow (CRT, 5.4323^0.5)

21158   cycles for 100 * f8Exp (Dave, 5.4321^0.5)
23167   cycles for 100 * ExpXY (MasmBasic, 5.4322^0.5)
33050   cycles for 100 * pow (CRT, 5.4323^0.5)

5.43210000000000  for f8Exp (Dave, 5.4321^0.5)^2
5.43220000000000  for ExpXY (MasmBasic, 5.4322^0.5)^2
5.43230000000000  for pow (CRT, 5.4323^0.5)^2

--- ok ---