News:

Masm32 SDK description, downloads and other helpful links
Message to All Guests
NB: Posting URL's See here: Posted URL Change

Main Menu

Floating point arithmetic question

Started by Lonewolff, April 16, 2018, 08:20:24 PM

Previous topic - Next topic

Lonewolff

So I stumbled across something hey?  :lol:

Heaps faster all round for real4's to do two fld's. :t

Maybe a caching thing in the CPU itself? Knows it already has the value there so just re-uses it perhaps?

RuiLoureiro

Quote from: jj2007 on April 25, 2018, 04:45:39 PM
Quote from: HSE on April 25, 2018, 10:17:25 AMWhat happen here? Cycles are so slower than the other machines?

No idea how this can run in 11 cycles (not including the loop overhead, though) ::)
Quote
  mov ebx, 99   ; loop 100x    <<<< THIS IS FOR real4, mem, mem ?
  align 4
  .Repeat
   fld MyR4
   fld MyR4        <<<<<<<<<<
   fstp st
   fstp st
   dec ebx
  .Until Sign?
-----------------------------
mov  ebx, 99                         <<<<< this is for real4, mem, st ?
align 4
.Repeat
  fld MyR4
  fld st        <<<<<<<<<<<<<<<<
  fstp st
  fstp st
  dec ebx
.Until Sign?
Good work Jochen  :t
Is this what you are doing, correct ?
Now i want to say that i never used real4 or real 8. Only real 10.

RuiLoureiro

Quote
Intel(R) Atom(TM) CPU N455   @ 1.66GHz (SSE4)

624     cycles for 100 * fld Real4 mem, mem
596     cycles for 100 * fld Real4 mem, st

603     cycles for 100 * fld Real8 mem, mem
595     cycles for 100 * fld Real8 mem, st

1426    cycles for 100 * fld Real10 mem, mem
1015    cycles for 100 * fld Real10 mem, st


604     cycles for 100 * fld Real4 mem, mem
595     cycles for 100 * fld Real4 mem, st

600     cycles for 100 * fld Real8 mem, mem
597     cycles for 100 * fld Real8 mem, st

1427    cycles for 100 * fld Real10 mem, mem
1002    cycles for 100 * fld Real10 mem, st


613     cycles for 100 * fld Real4 mem, mem
595     cycles for 100 * fld Real4 mem, st

615     cycles for 100 * fld Real8 mem, mem
599     cycles for 100 * fld Real8 mem, st

1416    cycles for 100 * fld Real10 mem, mem
1355    cycles for 100 * fld Real10 mem, st

16      bytes for fld Real4 mem, mem
12      bytes for fld Real4 mem, st
16      bytes for fld Real8 mem, mem
12      bytes for fld Real8 mem, st
16      bytes for fld Real10 mem, mem
12      bytes for fld Real10 mem, st


-


FORTRANS

F:\TEMP\TEST>fld_mem_
Intel(R) Pentium(R) M processor 1.70GHz (SSE2)

293     cycles for 100 * fld Real4 mem, mem
192     cycles for 100 * fld Real4 mem, st
298     cycles for 100 * fld Real8 mem, mem
186     cycles for 100 * fld Real8 mem, st
508     cycles for 100 * fld Real10 mem, mem
423     cycles for 100 * fld Real10 mem, st

290     cycles for 100 * fld Real4 mem, mem
193     cycles for 100 * fld Real4 mem, st
292     cycles for 100 * fld Real8 mem, mem
192     cycles for 100 * fld Real8 mem, st
508     cycles for 100 * fld Real10 mem, mem
423     cycles for 100 * fld Real10 mem, st

289     cycles for 100 * fld Real4 mem, mem
198     cycles for 100 * fld Real4 mem, st
296     cycles for 100 * fld Real8 mem, mem
193     cycles for 100 * fld Real8 mem, st
512     cycles for 100 * fld Real10 mem, mem
425     cycles for 100 * fld Real10 mem, st

16      bytes for fld Real4 mem, mem
12      bytes for fld Real4 mem, st
16      bytes for fld Real8 mem, mem
12      bytes for fld Real8 mem, st
16      bytes for fld Real10 mem, mem
12      bytes for fld Real10 mem, st


--- ok ---

jj2007

Interesting results ::)
- an AMD that does impossible things
- recent Intel i5/i7 where mem access is faster
- mobile cpus like Celeron and Pentium M where fld st is faster.

What does that teach us? Nothing 8)

Siekmanski

Making fast routines isn't easy in modern times.  :(
We need a database for optimizations per system.  :bgrin:
Creative coders use backward thinking techniques as a strategy.

hutch--

> What does that teach us?

Balanced algorithms with testing spread across mixed hardware.

daydreamer

Quote from: Siekmanski on April 26, 2018, 12:34:23 AM
Making fast routines isn't easy in modern times.  :(
We need a database for optimizations per system.  :bgrin:
make a similar code in masm, like JAVA's Just In Time (JIT) compiler
that first runs and times first code snippet,after that runs second version of code snippet etc and finally compare what runs fastest and keep code to only jump to that version of the code and put up a flag that signals that testing which is fastest is over
my none asm creations
https://masm32.com/board/index.php?topic=6937.msg74303#msg74303
I am an Invoker
"An Invoker is a mage who specializes in the manipulation of raw and elemental energies."
Like SIMD coding

Siekmanski

Creative coders use backward thinking techniques as a strategy.

RuiLoureiro

#54
Quote from: Siekmanski on April 26, 2018, 07:06:26 AM
Nah, I prefer Hutches approach.
Me too  :t
On your i7-4930K the results are, more or less, 373 cycles for real10, in both cases...

RuiLoureiro

Here it is another
Quote
Intel(R) Pentium(R) 4 CPU 3.40GHz (SSE3)

193     cycles for 100 * fld Real4 mem, mem
174     cycles for 100 * fld Real4 mem, st

178     cycles for 100 * fld Real8 mem, mem
162     cycles for 100 * fld Real8 mem, st

1644    cycles for 100 * fld Real10 mem, mem
838     cycles for 100 * fld Real10 mem, st
------------------------

177     cycles for 100 * fld Real4 mem, mem
158     cycles for 100 * fld Real4 mem, st

177     cycles for 100 * fld Real8 mem, mem
174     cycles for 100 * fld Real8 mem, st

1644    cycles for 100 * fld Real10 mem, mem
847     cycles for 100 * fld Real10 mem, st
------------------------

193     cycles for 100 * fld Real4 mem, mem
175     cycles for 100 * fld Real4 mem, st

179     cycles for 100 * fld Real8 mem, mem
159     cycles for 100 * fld Real8 mem, st

1649    cycles for 100 * fld Real10 mem, mem
838     cycles for 100 * fld Real10 mem, st
------------------------

16      bytes for fld Real4 mem, mem
12      bytes for fld Real4 mem, st
16      bytes for fld Real8 mem, mem
12      bytes for fld Real8 mem, st
16      bytes for fld Real10 mem, mem
12      bytes for fld Real10 mem, st


--- ok ---

FORTRANS

Hi Jochen,


Quote from: jj2007 on April 25, 2018, 11:49:08 PM
Interesting results ::)
- an AMD that does impossible things
- recent Intel i5/i7 where mem access is faster
- mobile cpus like Celeron and Pentium M where fld st is faster.

What does that teach us? Nothing 8)

Quote from: jj2007 on April 25, 2018, 04:45:39 PM
Quote from: HSE on April 25, 2018, 10:17:25 AMWhat happen here? Cycles are so slower than the other machines?

No idea how this can run in 11 cycles (not including the loop overhead, though) ::)
  mov ebx, 99 ; loop 100x
  align 4
  .Repeat
fld MyR4
fld MyR4
fstp st
fstp st
dec ebx
  .Until Sign?


   An idea occurred to me.*  Loading MyR4 twice may be cached
by the FPU?  Would making a 200 element array, and loading a
different value {say MyR4[X],MyR4[X+400]} change the timing
any?  Sorry if this is a silly question, the AMD results just looked
too odd.

Regards,

Steve N.

* Yes, it does happen.  Of course it doesn't imply a good one.

HSE

AMD A6-3500 APU with Radeon(tm) HD Graphics (SSE3)

11      cycles for 100 * fld Real4 mem, mem
7       cycles for 100 * fld Real4 mem, st
50      cycles for 100 * fld Real8 mem, mem
9       cycles for 100 * fld Real8 mem, st
600     cycles for 100 * fld Real10 mem, mem
309     cycles for 100 * fld Real10 mem, st

10      cycles for 100 * fld Real4 mem, mem
7       cycles for 100 * fld Real4 mem, st
50      cycles for 100 * fld Real8 mem, mem
7       cycles for 100 * fld Real8 mem, st
602     cycles for 100 * fld Real10 mem, mem
309     cycles for 100 * fld Real10 mem, st

9       cycles for 100 * fld Real4 mem, mem
7       cycles for 100 * fld Real4 mem, st
52      cycles for 100 * fld Real8 mem, mem
7       cycles for 100 * fld Real8 mem, st
600     cycles for 100 * fld Real10 mem, mem
309     cycles for 100 * fld Real10 mem, st

16      bytes for fld Real4 mem, mem
12      bytes for fld Real4 mem, st
16      bytes for fld Real8 mem, mem
12      bytes for fld Real8 mem, st
16      bytes for fld Real10 mem, mem
12      bytes for fld Real10 mem, st


--- ok ---


Using myR4 and myR4b, myR8 and myR8b, and myR10 and myR10b.

Note that R8 is slower now, but not different with R4.
Equations in Assembly: SmplMath

jj2007

Quote from: FORTRANS on April 26, 2018, 11:05:12 PMAn idea occurred to me.*  Loading MyR4 twice may be cached
by the FPU?  Would making a 200 element array, and loading a
different value {say MyR4[X],MyR4[X+400]} change the timing
any?  Sorry if this is a silly question, the AMD results just looked
too odd.

Steve,
That is definitely not a silly question, so I tested it with randomly filled arrays:  mov ebx, AlgoLoops-1
  align 4
  .Repeat
fld MyR4[4*ebx]
fld MyR4[4*ebx]
fstp st
fstp st
dec ebx
  .Until Sign?

Results with AlgoLoops=10,000:
Intel(R) Core(TM) i5-2450M CPU @ 2.50GHz (SSE4)

20248   cycles for 10000 * fld Real4 mem, mem
30267   cycles for 10000 * fld Real4 mem, st
20704   cycles for 10000 * fld Real8 mem, mem
30563   cycles for 10000 * fld Real8 mem, st
40031   cycles for 10000 * fld Real10 mem, mem
40005   cycles for 10000 * fld Real10 mem, st

20247   cycles for 10000 * fld Real4 mem, mem
30314   cycles for 10000 * fld Real4 mem, st
20708   cycles for 10000 * fld Real8 mem, mem
30512   cycles for 10000 * fld Real8 mem, st
40018   cycles for 10000 * fld Real10 mem, mem
40026   cycles for 10000 * fld Real10 mem, st

20248   cycles for 10000 * fld Real4 mem, mem
30272   cycles for 10000 * fld Real4 mem, st
20755   cycles for 10000 * fld Real8 mem, mem
30492   cycles for 10000 * fld Real8 mem, st
40013   cycles for 10000 * fld Real10 mem, mem
40015   cycles for 10000 * fld Real10 mem, st


In short: My i5 couldn't care less :P

Now the Celeron:Intel(R) Celeron(R) CPU  N2840  @ 2.16GHz (SSE4)
16354   cycles for 10000 * fld Real4 mem, mem
15524   cycles for 10000 * fld Real4 mem, st
18059   cycles for 10000 * fld Real8 mem, mem
17376   cycles for 10000 * fld Real8 mem, st
98249   cycles for 10000 * fld Real10 mem, mem
57020   cycles for 10000 * fld Real10 mem, st

16382   cycles for 10000 * fld Real4 mem, mem
17095   cycles for 10000 * fld Real4 mem, st
18286   cycles for 10000 * fld Real8 mem, mem
17329   cycles for 10000 * fld Real8 mem, st
99281   cycles for 10000 * fld Real10 mem, mem
56412   cycles for 10000 * fld Real10 mem, st

15761   cycles for 10000 * fld Real4 mem, mem
16297   cycles for 10000 * fld Real4 mem, st
17914   cycles for 10000 * fld Real8 mem, mem
18403   cycles for 10000 * fld Real8 mem, st
98288   cycles for 10000 * fld Real10 mem, mem
56802   cycles for 10000 * fld Real10 mem, st

HSE


AMD A6-3500 APU with Radeon(tm) HD Graphics (SSE3)

1811    cycles for 10000 * fld Real4 mem, mem
1619    cycles for 10000 * fld Real4 mem, st
3760    cycles for 10000 * fld Real8 mem, mem
??      cycles for 10000 * fld Real8 mem, st
55645   cycles for 10000 * fld Real10 mem, mem
32089   cycles for 10000 * fld Real10 mem, st

1313    cycles for 10000 * fld Real4 mem, mem
??      cycles for 10000 * fld Real4 mem, st
5816    cycles for 10000 * fld Real8 mem, mem
1112    cycles for 10000 * fld Real8 mem, st
61808   cycles for 10000 * fld Real10 mem, mem
31172   cycles for 10000 * fld Real10 mem, st

??      cycles for 10000 * fld Real4 mem, mem
??      cycles for 10000 * fld Real4 mem, st
3241    cycles for 10000 * fld Real8 mem, mem
??      cycles for 10000 * fld Real8 mem, st
61940   cycles for 10000 * fld Real10 mem, mem
27024   cycles for 10000 * fld Real10 mem, st

18      bytes for fld Real4 mem, mem
13      bytes for fld Real4 mem, st
18      bytes for fld Real8 mem, mem
13      bytes for fld Real8 mem, st
16      bytes for fld Real10 mem, mem
12      bytes for fld Real10 mem, st


--- ok ---


8)
now with array in .data?
AMD A6-3500 APU with Radeon(tm) HD Graphics (SSE3)

289     cycles for 10000 * fld Real4 mem, mem
1526    cycles for 10000 * fld Real4 mem, st
1665    cycles for 10000 * fld Real8 mem, mem
2670    cycles for 10000 * fld Real8 mem, st
51864   cycles for 10000 * fld Real10 mem, mem
25661   cycles for 10000 * fld Real10 mem, st

270     cycles for 10000 * fld Real4 mem, mem
1552    cycles for 10000 * fld Real4 mem, st
726     cycles for 10000 * fld Real8 mem, mem
2681    cycles for 10000 * fld Real8 mem, st
61508   cycles for 10000 * fld Real10 mem, mem
25214   cycles for 10000 * fld Real10 mem, st

1824    cycles for 10000 * fld Real4 mem, mem
??      cycles for 10000 * fld Real4 mem, st
177     cycles for 10000 * fld Real8 mem, mem
3478    cycles for 10000 * fld Real8 mem, st
51708   cycles for 10000 * fld Real10 mem, mem
31514   cycles for 10000 * fld Real10 mem, st

18      bytes for fld Real4 mem, mem
13      bytes for fld Real4 mem, st
18      bytes for fld Real8 mem, mem
13      bytes for fld Real8 mem, st
16      bytes for fld Real10 mem, mem
12      bytes for fld Real10 mem, st


--- ok ---


fld RealXX mem, mem  faster than  fld RealXX mem, st ??
Equations in Assembly: SmplMath