News:

Masm32 SDK description, downloads and other helpful links
Message to All Guests

Main Menu

Floating point arithmetic question

Started by Lonewolff, April 16, 2018, 08:20:24 PM

Previous topic - Next topic

RuiLoureiro


jj2007

Quote from: RuiLoureiro on April 27, 2018, 12:04:37 AM
?? cycles ... ? what it means ?

It means the timing procedure could not establish a valid value. There is something strange with the AMD:
289     cycles for 10000 * fld Real4 mem, mem

This is simply impossible. My template uses cpuid + rdtsc from Michael Webster's timer macros; no idea what could happen there:
counter_begin TimerLoops, HIGH_PRIORITY_CLASS
call TestA
counter_end
ShowCycles TestA


Could it be that the loop overhead is wrong, or that the spinup loop is too short? Attached a special version for HSE that shows the overhead, and uses once a much longer spinup loop (20x):
Intel(R) Core(TM) i5-2450M CPU @ 2.50GHz (SSE4)
loop overhead is approx. 10067/10000 cycles

20191   cycles for 10000 * fld Real4 mem, mem
30011   cycles for 10000 * fld Real4 mem, st
20753   cycles for 10000 * fld Real8 mem, mem
30031   cycles for 10000 * fld Real8 mem, st
39976   cycles for 10000 * fld Real10 mem, mem
40001   cycles for 10000 * fld Real10 mem, st

FORTRANS

Hi,

   Thanks for testing.

Quote from: jj2007 on April 26, 2018, 11:34:14 PM

Steve,
That is definitely not a silly question, so I tested it with randomly filled arrays:

    fld MyR4[4*ebx]
fld MyR4[4*ebx]


   But again you load the same value twice in a row.  That's why I
put in a +400 for the second load.  But HSE's results show that
probably would not change things on his AMD.

Thanks,

Steve

jimg

AMD Phenom(tm) II X6 1045T Processor (SSE3)
Spinup done
loop overhead is approx. 20057/10000 cycles

879     cycles for 10000 * fld Real4 mem, mem
6518    cycles for 10000 * fld Real4 mem, st
3628    cycles for 10000 * fld Real8 mem, mem
4023    cycles for 10000 * fld Real8 mem, st
60146   cycles for 10000 * fld Real10 mem, mem
47240   cycles for 10000 * fld Real10 mem, st

841     cycles for 10000 * fld Real4 mem, mem
1091    cycles for 10000 * fld Real4 mem, st
3911    cycles for 10000 * fld Real8 mem, mem
5221    cycles for 10000 * fld Real8 mem, st
60101   cycles for 10000 * fld Real10 mem, mem
43169   cycles for 10000 * fld Real10 mem, st

695     cycles for 10000 * fld Real4 mem, mem
870     cycles for 10000 * fld Real4 mem, st
4738    cycles for 10000 * fld Real8 mem, mem
4917    cycles for 10000 * fld Real8 mem, st
60153   cycles for 10000 * fld Real10 mem, mem
35096   cycles for 10000 * fld Real10 mem, st

18      bytes for fld Real4 mem, mem
13      bytes for fld Real4 mem, st
18      bytes for fld Real8 mem, mem
13      bytes for fld Real8 mem, st
16      bytes for fld Real10 mem, mem
12      bytes for fld Real10 mem, st


-

Siekmanski

Intel(R) Core(TM) i7-4930K CPU @ 3.40GHz (SSE4)
Spinup done
loop overhead is approx. 10454/10000 cycles

20329   cycles for 10000 * fld Real4 mem, mem
29593   cycles for 10000 * fld Real4 mem, st
21202   cycles for 10000 * fld Real8 mem, mem
29582   cycles for 10000 * fld Real8 mem, st
39593   cycles for 10000 * fld Real10 mem, mem
39595   cycles for 10000 * fld Real10 mem, st

19664   cycles for 10000 * fld Real4 mem, mem
29596   cycles for 10000 * fld Real4 mem, st
19650   cycles for 10000 * fld Real8 mem, mem
29598   cycles for 10000 * fld Real8 mem, st
39594   cycles for 10000 * fld Real10 mem, mem
39597   cycles for 10000 * fld Real10 mem, st

19614   cycles for 10000 * fld Real4 mem, mem
29630   cycles for 10000 * fld Real4 mem, st
19701   cycles for 10000 * fld Real8 mem, mem
29659   cycles for 10000 * fld Real8 mem, st
39591   cycles for 10000 * fld Real10 mem, mem
39590   cycles for 10000 * fld Real10 mem, st

18      bytes for fld Real4 mem, mem
13      bytes for fld Real4 mem, st
18      bytes for fld Real8 mem, mem
13      bytes for fld Real8 mem, st
16      bytes for fld Real10 mem, mem
12      bytes for fld Real10 mem, st


--- ok ---
Creative coders use backward thinking techniques as a strategy.

jj2007

Quote from: FORTRANS on April 27, 2018, 12:44:51 AMBut again you load the same value twice in a row.

OK, since you insist :P
fld MyR4[4*ebx]
fld MyR4[4*ebx+4*2000]


Note that Jim's AMD needs twice as much time for the naked loop; it could be that the AMD does loop and fpu loads simultaneously. Agner says the latest Ryzen is faster:
QuoteThis makes it possible to execute a tiny loop with up to six instructions in one clock cycle per iteration

Intel(R) Core(TM) i5-2450M CPU @ 2.50GHz (SSE4)
Spinup done, 4,200,000,000*dec eax
loop overhead is approx. 10112/10000 cycles

20271   cycles for 10000 * fld Real4 mem, mem
30139   cycles for 10000 * fld Real4 mem, st
20586   cycles for 10000 * fld Real8 mem, mem
30194   cycles for 10000 * fld Real8 mem, st
42013   cycles for 10000 * fld Real10 mem, mem
40749   cycles for 10000 * fld Real10 mem, st

20330   cycles for 10000 * fld Real4 mem, mem
30232   cycles for 10000 * fld Real4 mem, st
20648   cycles for 10000 * fld Real8 mem, mem
30324   cycles for 10000 * fld Real8 mem, st
41906   cycles for 10000 * fld Real10 mem, mem
40546   cycles for 10000 * fld Real10 mem, st

20365   cycles for 10000 * fld Real4 mem, mem
30313   cycles for 10000 * fld Real4 mem, st
20606   cycles for 10000 * fld Real8 mem, mem
30312   cycles for 10000 * fld Real8 mem, st
42380   cycles for 10000 * fld Real10 mem, mem
41474   cycles for 10000 * fld Real10 mem, st

FORTRANS

Hi Jochen,

   Thank you again.  Now my results have changed.  So I am
confused once more.

F:\TEMP\TEST>FLD_MEM_.exe
Intel(R) Pentium(R) M processor 1.70GHz (SSE2)
Spinup done, 4,200,000,000*dec eax in 2695 ms
loop overhead is approx. 10082/10000 cycles

22465   cycles for 10000 * fld Real4 mem, mem
25357   cycles for 10000 * fld Real4 mem, st
27534   cycles for 10000 * fld Real8 mem, mem
28920   cycles for 10000 * fld Real8 mem, st
75225   cycles for 10000 * fld Real10 mem, mem
52567   cycles for 10000 * fld Real10 mem, st

22216   cycles for 10000 * fld Real4 mem, mem
26432   cycles for 10000 * fld Real4 mem, st
25397   cycles for 10000 * fld Real8 mem, mem
27759   cycles for 10000 * fld Real8 mem, st
70948   cycles for 10000 * fld Real10 mem, mem
51487   cycles for 10000 * fld Real10 mem, st

20858   cycles for 10000 * fld Real4 mem, mem
25416   cycles for 10000 * fld Real4 mem, st
24065   cycles for 10000 * fld Real8 mem, mem
28251   cycles for 10000 * fld Real8 mem, st
70658   cycles for 10000 * fld Real10 mem, mem
51233   cycles for 10000 * fld Real10 mem, st


--- ok ---


   At least the Real10 results did not swap places.

Regards,

Steve N.

RuiLoureiro

Results for the last version
Quote
Intel(R) Pentium(R) 4 CPU 3.40GHz (SSE3)
Spinup done, 4,200,000,000*dec eax in 2041 ms
loop overhead is approx. 18160/10000 cycles

33459   cycles for 10000 * fld Real4 mem, mem
37247   cycles for 10000 * fld Real4 mem, st
49295   cycles for 10000 * fld Real8 mem, mem
45967   cycles for 10000 * fld Real8 mem, st
186538 cycles for 10000 * fld Real10 mem, mem
108538  cycles for 10000 * fld Real10 mem, st

31169   cycles for 10000 * fld Real4 mem, mem
37200   cycles for 10000 * fld Real4 mem, st
41419   cycles for 10000 * fld Real8 mem, mem
46164   cycles for 10000 * fld Real8 mem, st
184284 cycles for 10000 * fld Real10 mem, mem
108466  cycles for 10000 * fld Real10 mem, st

31042   cycles for 10000 * fld Real4 mem, mem
37448   cycles for 10000 * fld Real4 mem, st
42335   cycles for 10000 * fld Real8 mem, mem
46058   cycles for 10000 * fld Real8 mem, st
184647 cycles for 10000 * fld Real10 mem, mem
110142 cycles for 10000 * fld Real10 mem, st


--- ok ---

HSE

AMD A6-3500 APU with Radeon(tm) HD Graphics (SSE3)
Spinup done, 4,200,000,000*dec eax in 3757 ms
loop overhead is approx. 19715/10000 cycles

10047   cycles for 10000 * fld Real4 mem, mem
7442    cycles for 10000 * fld Real4 mem, st
13013   cycles for 10000 * fld Real8 mem, mem
9412    cycles for 10000 * fld Real8 mem, st
65984   cycles for 10000 * fld Real10 mem, mem
27897   cycles for 10000 * fld Real10 mem, st

7320    cycles for 10000 * fld Real4 mem, mem
11115   cycles for 10000 * fld Real4 mem, st
12497   cycles for 10000 * fld Real8 mem, mem
12401   cycles for 10000 * fld Real8 mem, st
63567   cycles for 10000 * fld Real10 mem, mem
33683   cycles for 10000 * fld Real10 mem, st

7254    cycles for 10000 * fld Real4 mem, mem
10440   cycles for 10000 * fld Real4 mem, st
12343   cycles for 10000 * fld Real8 mem, mem
7372    cycles for 10000 * fld Real8 mem, st
63265   cycles for 10000 * fld Real10 mem, mem
30161   cycles for 10000 * fld Real10 mem, st


--- ok ---
Equations in Assembly: SmplMath

Lonewolff

What are you guys using to run this?

jj2007

The exe attached to Reply #65.

AMD A6-3500 APU with Radeon(tm) HD Graphics (SSE3)
Spinup done, 4,200,000,000*dec eax in 3757 ms
loop overhead is approx. 19715/10000 cycles


So the AMD takes a long time to get to full speed, and has a high loop overhead. Hm :(

Lonewolff

Cool  :t


16575   cycles for 10000 * fld Real4 mem, mem
18297   cycles for 10000 * fld Real4 mem, st
26425   cycles for 10000 * fld Real8 mem, mem
19634   cycles for 10000 * fld Real8 mem, st
79024   cycles for 10000 * fld Real10 mem, mem
42922   cycles for 10000 * fld Real10 mem, st

16270   cycles for 10000 * fld Real4 mem, mem
16712   cycles for 10000 * fld Real4 mem, st
24982   cycles for 10000 * fld Real8 mem, mem
17107   cycles for 10000 * fld Real8 mem, st
77920   cycles for 10000 * fld Real10 mem, mem
44588   cycles for 10000 * fld Real10 mem, st

17735   cycles for 10000 * fld Real4 mem, mem
16897   cycles for 10000 * fld Real4 mem, st
23506   cycles for 10000 * fld Real8 mem, mem
23064   cycles for 10000 * fld Real8 mem, st
79142   cycles for 10000 * fld Real10 mem, mem
41752   cycles for 10000 * fld Real10 mem, st

HSE

Equations in Assembly: SmplMath


HSE

Sorry Lone I wass asking other thing to JJ.

Pipeline is the nickname of a RISC processors feature.They process an instruction, search memory of the next instruction and read the following instruction in the same cycle (or something like that  :biggrin:).
Equations in Assembly: SmplMath