The MASM Forum

General => The Campus => Topic started by: Lonewolff on April 16, 2018, 08:20:24 PM

Title: Floating point arithmetic question
Post by: Lonewolff on April 16, 2018, 08:20:24 PM
Hi guys,

I am trying to work out how to do a simple floating point subtraction but the result is incorrect.


fild valA
fild valB
fsub
fstp result


valA is a real4 of 1000.0 and valB is a real4 of 1.0

I was expecting a result of 999.0 but I get -8.34929E+07.

Any advice would be much appreciated  8)
Title: Re: Floating point arithmetic question
Post by: Lonewolff on April 16, 2018, 08:32:37 PM
Ah! Worked it out. I should have been using fld not fild  :eusa_clap:
Title: Re: Floating point arithmetic question
Post by: daydreamer on April 16, 2018, 08:58:39 PM
For simple math and sqrt and rsqrt use SSE instead,its easier to make code faster in a loop
movss xmm0,val1
subss xmm0,val2
movss result,xmm0
And when you need speedup,use movaps,subps etc instead
Title: Re: Floating point arithmetic question
Post by: Lonewolff on April 16, 2018, 09:05:22 PM
Awesome! Thanks for the tip  8)
Title: Re: Floating point arithmetic question
Post by: jj2007 on April 16, 2018, 09:22:25 PM
Quote from: daydreamer on April 16, 2018, 08:58:39 PM
For simple math and sqrt and rsqrt use SSE instead,its easier to make code faster in a loop
movss xmm0,val1
subss xmm0,val2
movss result,xmm0

And if you are not absolutely sure, time it:Intel(R) Core(TM) i5-2450M CPU @ 2.50GHz (SSE4)

87      cycles for 100 * movss
59      cycles for 100 * fpu

80      cycles for 100 * movss
59      cycles for 100 * fpu

80      cycles for 100 * movss
59      cycles for 100 * fpu

84      cycles for 100 * movss
59      cycles for 100 * fpu

80      cycles for 100 * movss
60      cycles for 100 * fpu

24      bytes for movss
18      bytes for fpu


movss: movss xmm0, val1
subss xmm0, val2
movss result, xmm0


fpu: fld val1
fsub val2
fstp result
Title: Re: Floating point arithmetic question
Post by: Lonewolff on April 16, 2018, 09:33:46 PM
Yep. Been timing routines heaps today.

Trying to make them as tight as possible as they are performance critical for my project.

Lot's of testing, timing, and learning going on  :biggrin:
Title: Re: Floating point arithmetic question
Post by: Lonewolff on April 16, 2018, 10:22:06 PM
Having another small issue.

I am trying to calculate the tangent of a value using the fp* commands.


fld VALUE ;// contains 0.523599
fptan
fstp RESULT ;// should be 0.57735


According to ConverterDD (http://masm32.com/board/index.php?topic=1819.0) the value in RESULT is NaN.

ConverterDD has being displaying all results correctly so far.

Am I using fptan correctly?
Title: Re: Floating point arithmetic question
Post by: jj2007 on April 16, 2018, 10:43:20 PM
Quote from: Lonewolff on April 16, 2018, 10:22:06 PMAm I using fptan correctly?
What does the help file say?

include \masm32\MasmBasic\MasmBasic.inc         ; download (http://masm32.com/board/index.php?topic=94.0)
  Init
  fld FP4(0.523599)
  fptan
  deb 4, "The FPU:", ST(0), ST(1)
EndOfCode


The FPU:
ST(0)           1.000000000000000000
ST(1)           0.5773506065083982818
Title: Re: Floating point arithmetic question
Post by: RuiLoureiro on April 16, 2018, 11:45:22 PM
Quote from: Lonewolff on April 16, 2018, 10:22:06 PM
Having another small issue.

I am trying to calculate the tangent of a value using the fp* commands.

Quote
   fld VALUE         ;// contains 0.523599
   fptan
   fstp  st            ; remove 1.0 from st(0)<<-- See Simply FPU by Raymond
   fstp RESULT      ;// should be 0.57735

According to ConverterDD (http://masm32.com/board/index.php?topic=1819.0 (http://masm32.com/board/index.php?topic=1819.0)) the value in RESULT is NaN.

ConverterDD has being displaying all results correctly so far.

Am I using fptan correctly?    <<<--- NO, it seems you want tan(VALUE)=0.5773505683919327
Hi
    Could you post this simple example (the asm file) ?
Title: Re: Floating point arithmetic question
Post by: Lonewolff on April 17, 2018, 10:05:25 AM
Quote from: jj2007 on April 16, 2018, 10:43:20 PM
What does the help file say?

Help file says I am. The result says otherwise.


@RuiLoureiro - Will do  :t
Title: Re: Floating point arithmetic question
Post by: Lonewolff on April 17, 2018, 10:12:49 AM
This works though.


fld number ;// contains 0.523599
fptan
fstp result ;// contains 1 why?
fstp result ;// should be 0.57735


Why is it that two fstp calls are required?
Title: Re: Floating point arithmetic question
Post by: raymond on April 17, 2018, 10:27:52 AM
QuoteWhy is it that two fstp calls are required?

If you follow up on reading the recommended FPU tutorial (more specifically the part relating to the fptan instruction at http://www.ray.masmcode.com/tutorial/fpuchap10.htm#fptan) you would get your answer to your questions, including the last.

It may also give you a hint to explain one of your previous comment:
QuoteAccording to ConverterDD (http://masm32.com/board/index.php?topic=1819.0) the value in RESULT is NaN.
That may possibly be due to valid data already being in the ST(7) and/or the ST(0) register when attempting to compute the tangent. Otherwise
a value of "1" should have been returned by ConverterDD.
Title: Re: Floating point arithmetic question
Post by: Lonewolff on April 17, 2018, 02:03:21 PM
Thanks for the link to the tutorial. Clears it up well  :t
Title: Re: Floating point arithmetic question
Post by: RuiLoureiro on April 17, 2018, 10:41:22 PM
Quote from: Lonewolff on April 17, 2018, 10:12:49 AM
This works though.

Quote
   fld number         ;// contains 0.523599
   fptan
   fstp resultX         ;// contains 1 why?  <<<- because fptan gives 2 results, not 1
   fstp result         ;// should be 0.57735

Why is it that two fstp calls are required?
What do you get for the new resultX variable ? Try to see. Print it. It should be 1.0 as Raymond said.
              fstp  st should be used in this case, it is faster. It removes the current st(0) when we dont need it.
:t
Title: Re: Floating point arithmetic question
Post by: jj2007 on April 17, 2018, 10:55:42 PM
Quote from: RuiLoureiro on April 17, 2018, 10:41:22 PMfstp  st should be used, it is faster. It removes the current st(0) when we dont need it.

Btw there is also fincstp, which at first sight has the same effect. But try a simple fldpi afterwards, and you'll see the difference. Olly has a section with the FPU regs, you must scroll down a little bit the upper right pane to see it.
Title: Re: Floating point arithmetic question
Post by: RuiLoureiro on April 18, 2018, 04:04:04 AM
Quote from: jj2007 on April 17, 2018, 10:55:42 PM
Quote from: RuiLoureiro on April 17, 2018, 10:41:22 PMfstp  st should be used, it is faster. It removes the current st(0) when we dont need it.

Btw there is also fincstp, which at first sight has the same effect. But try a simple fldpi afterwards, and you'll see the difference. Olly has a section with the FPU regs, you must scroll down a little bit the upper right pane to see it.
Ok Jochen. I dont know where i have Olly in this computer. I rarely use it. LoneWolff may do this to learn.
            What i want to say to LoneWolff is this: generally, we use fstp st to remove st(0) when we dont need it.
Title: Re: Floating point arithmetic question
Post by: Lonewolff on April 18, 2018, 07:47:06 AM
Yeah, the fstp resultX call was just to get rid of the value. I'll swap out the command now that that I know about it.

Thanks guys! Awesome information as always  8)
Title: Re: Floating point arithmetic question
Post by: RuiLoureiro on April 18, 2018, 08:03:43 AM
Quote from: Lonewolff on April 18, 2018, 07:47:06 AM
Yeah, the fstp resultX call was just to get rid of the value. I'll swap out the command now that that I know about it.

Thanks guys! Awesome information as always  8)
Good luck  :t
Title: Re: Floating point arithmetic question
Post by: Lonewolff on April 18, 2018, 08:17:31 AM
Quote from: RuiLoureiro on April 18, 2018, 08:03:43 AM
Good luck  :t

Thanks Rui  :biggrin:

There was a measurable performance increase with the fstp st method.


fld number
fptan
fstp st
fstp result



But fincstp gave me a NaN result.


fld number
fptan
fincstp
fstp result
Title: Re: Floating point arithmetic question
Post by: Siekmanski on April 18, 2018, 08:22:52 AM
Try this one,
ffree   st(0)
Title: Re: Floating point arithmetic question
Post by: Lonewolff on April 18, 2018, 08:26:21 AM

fld number
fptan
ffree st(0)
fstp result


That one gives me a Nan result also  :(
Title: Re: Floating point arithmetic question
Post by: Lonewolff on April 18, 2018, 08:34:38 AM
Was just reading about the fincstp method.

http://qcd.phys.cmu.edu/QCDcluster/intel/vtune/reference/vc98.htm

This operation is not equivalent to popping the stack, because the tag for the previous top-of-stack register is not marked empty.
Title: Re: Floating point arithmetic question
Post by: raymond on April 18, 2018, 10:18:38 AM
Quote from: Lonewolff on April 18, 2018, 08:26:21 AM

fld number
fptan
ffree st(0)
fstp result


That one gives me a Nan result also  :(

Let's see if you can use your skills to find out in the tutorial why you are getting that result. ;)
Title: Re: Floating point arithmetic question
Post by: Lonewolff on April 18, 2018, 10:31:29 AM
Quote from: raymond on April 18, 2018, 10:18:38 AM
Quote from: Lonewolff on April 18, 2018, 08:26:21 AM

fld number
fptan
ffree st(0)
fstp result


That one gives me a Nan result also  :(

Let's see if you can use your skills to find out in the tutorial why you are getting that result. ;)

Ha! Got me again.

It's that double pop thing again. Had only just woken up when I tried it out.  :t
Title: Re: Floating point arithmetic question
Post by: jj2007 on April 18, 2018, 11:21:32 AM
Quote from: Siekmanski on April 18, 2018, 08:22:52 AM
Try this one,
ffree   st(0)

That instruction is about as rare as fincstp, and for the same reason :P

Quote from: Lonewolff on April 18, 2018, 08:34:38 AMThis operation is not equivalent to popping the stack, because the tag for the previous top-of-stack register is not marked empty.

Exactly :t

The point here is: fincstp rotates ST(0) into ST(7). If ST(7) is not empty, i.e. it carries a value, then any attempt to load ST(0) will fail.

In practice, you will never see ffree ST(0) or fincstp, but you will often see
fstp st ; pop ST, clear ST(7)
ffree ST(7) ; clear ST(7)
fxch ; to get ST(1), exchange ST(0), ST(1)
Title: Re: Floating point arithmetic question
Post by: Lonewolff on April 18, 2018, 11:24:55 AM
Quote from: jj2007 on April 18, 2018, 11:21:32 AM
Quote from: Siekmanski on April 18, 2018, 08:22:52 AM
Try this one,
ffree   st(0)

That instruction is about as rare as fincstp, and for the same reason :P

I found that using that instruction, I had to fincstp anyway to get the right result. Which negated any performance gains.

Seems that fstp st has given the best performance so far.
Title: Re: Floating point arithmetic question
Post by: RuiLoureiro on April 19, 2018, 04:42:17 AM
Hi LoneWolff,
                   fstp    st  is the same as fstp  st(0).
                   But there are cases where we want to remove st(1) ...
                   In these cases we use  fstp  st(1) ...

                   In some cases, after one FPU instruction, we add this code to detect an error:

                   fstsw      ax         ; store Status Word register to AX register
                   fwait
                   shr         ax, 1     ; move bit 0 to carry flag
                   jc           _iserror
                   ; go on, here not error
                   ...
                   ; exit here without error


_iserror:     fstp         st     ; remove st(0)
                  fclex                ; clear all bits in the status word register -> new instruction, new error ?
                  ; exit here with an error message
Title: Re: Floating point arithmetic question
Post by: Lonewolff on April 24, 2018, 09:23:29 AM
Hey guys,

I thought I'd ask this in this topic rather than create a new one, as it is still a FP question.

Is there a more efficient way of writing this? I regards to the final 'fstp' call.

This code works fine, just wondering if it is optimal.

fld fP ; farPlane / (farPlane - nearPlane)
fld nP
fsub
fld fP
fdiv st(0),st(1)
fstp _res
mov ecx, _res
mov [eax+40], ecx ; Store result in _33
fstp _res ; Clear the last value from the FP stack


If I don't call the final line, a value gets left on the FP stack. I could call 'fpinit' but that is a very slow call.

Just wondering if I am going about this the right way.

Thanks again  8)


[edit]
Actually, thinking about it, I probably just need to play with the order of operation.


[edit2]
Yep worked well. And one line less.


fld fP ; farPlane / (farPlane - nearPlane)
fld fP
fld nP
fsub
fdiv
fstp _res
mov ecx, _res
mov [eax+40], ecx ; Store result in _33
Title: Re: Floating point arithmetic question
Post by: Siekmanski on April 24, 2018, 10:03:12 AM
   fld fp            ; farPlane / (farPlane - nearPlane)
   fld st
   fld rp
   fsubp
   fdivp
   fstp real4 ptr [eax+40]

:biggrin:
Title: Re: Floating point arithmetic question
Post by: Lonewolff on April 24, 2018, 10:05:52 AM
Nice!

Even better  :eusa_clap:

[edit]
Hang on. It says invalid operands on fsubp and fdivp.  :P

Were those two lines a typo?
Title: Re: Floating point arithmetic question
Post by: Siekmanski on April 24, 2018, 10:13:51 AM
No, it works.
fsub with a register pop
fdiv with a register pop

edit: I wrote rp instead of np, the rest is ok.

   fld      farPlane            ; farPlane / (farPlane - nearPlane)
   fld      st(0)
   fld      nearPlane
   fsubp
   fdivp
   fstp     real4 ptr [eax+40]
Title: Re: Floating point arithmetic question
Post by: Lonewolff on April 24, 2018, 10:25:34 AM
Weird  :icon_confused:

Doesn't compile for me with the 'p' suffix.

Quote
(22) : error A2070: invalid instruction operands
(23) : error A2070: invalid instruction operands


fld fP ; farPlane / (farPlane - nearPlane)
fld fP
fld nP
fsubp   ; line 22
fdivp   ; line 23
fstp real4 ptr [eax+40] ; Store result in _33


Title: Re: Floating point arithmetic question
Post by: Siekmanski on April 24, 2018, 10:32:45 AM
Strange, it's a valid instruction.

https://www.coursehero.com/file/p42643a0/The-FSUBP-instructions-perform-the-additional-operation-of-popping-the-FPU/

https://www.coursehero.com/file/p42643a0/The-FDIVP-instructions-perform-the-additional-operation-of-popping-the-FPU/
Title: Re: Floating point arithmetic question
Post by: raymond on April 24, 2018, 11:30:52 AM
I would have coded it as follows:
fld  fp   ;fp
fld  st   ;fp   fp   ;make copy in st(0)
fsub np   ;(fp-fn)  fp   ;subtract fn from fp in st(0)
fdiv      ;fp/(fp-fn)   ;divide fp in st(1) by the content of st(0) and pop st(0)
          ;st(0) now contains the result of the division
fstp real4 ptr [eax+40]   ;store result and clean fpu


Check the 'Chapter 8 - Arithmetic instructions - with REAL numbers' at http://www.ray.masmcode.com/tutorial/fpuchap8.htm
Title: Re: Floating point arithmetic question
Post by: Lonewolff on April 24, 2018, 01:17:57 PM
Thanks for the 'fsub' tip.  :t

I am finding that this...

fld fP
fld fP


...is bench marking faster than this though.


fld fP
fld st
Title: Re: Floating point arithmetic question
Post by: raymond on April 24, 2018, 01:24:30 PM
Quote from: Lonewolff on April 24, 2018, 01:17:57 PM
Thanks for the 'fsub' tip.  :t

I am finding that this...

fld fP
fld fP


...is bench marking faster than this though.

fld fP
fld st


Truly surprising. Fetching data from memory, even from cache, is usually considered slower than register-to-register operations.
Title: Re: Floating point arithmetic question
Post by: Lonewolff on April 24, 2018, 01:28:16 PM
Quote from: raymond on April 24, 2018, 01:24:30 PM
Truly surprising. Fetching data from memory, even from cache, is usually considered slower than register-to-register operations.

Yeah I am looping the routine 1000000000 times and timing how long to complete (the whole code snippet that is).

fld ST version      6578 ms
fld fP version      6422 ms

No matter how many times I run it the fP version is quicker on every occasion.
Title: Re: Floating point arithmetic question
Post by: Lonewolff on April 24, 2018, 01:42:12 PM
Wow, the CPU really is a quirking beast.

I have another routine that uses the 'fchs' opcode and I can speed up (or slow down) the tests by 200 ms, just by changing when 'fchs' is called.  :dazzled:

I have run it over and over and the results consistently differ by 200 ms, just depending on the placement of 'fchs'.
Title: Re: Floating point arithmetic question
Post by: jj2007 on April 24, 2018, 04:58:19 PM
Quote from: Lonewolff on April 24, 2018, 10:25:34 AM
Weird  :icon_confused:

Doesn't compile for me with the 'p' suffix.

   fsubp   ; line 22
   fdivp   ; line 23

You may use:
- fsub (no p, but does the same)
- fsubp st(1), st
- an assembler higher than ML 6.15 (8.0+, UAsm, AsmC, JWasm)
Title: Re: Floating point arithmetic question
Post by: RuiLoureiro on April 25, 2018, 12:17:35 AM
Quote from: Lonewolff on April 24, 2018, 01:28:16 PM
Quote from: raymond on April 24, 2018, 01:24:30 PM
Truly surprising. Fetching data from memory, even from cache, is usually considered slower than register-to-register operations.

Yeah I am looping the routine 1000000000 times and timing how long to complete (the whole code snippet that is).

fld ST version      6578 ms
fld fP version       6422 ms

No matter how many times I run it the fP version is quicker on every occasion.
It seems that you are testing the code on your current CPU.
            Is it faster on another CPU ?
            Any current load is to st(0). So it seems more logical to copy the current st(0)
                 into a new st(0)
- fld  st(0) - than to load the same memory value into a new st(0).
            So i prefer  fld  ST version.
Title: Re: Floating point arithmetic question
Post by: Lonewolff on April 25, 2018, 08:23:54 AM
It's tested on a relatively recent AMD. I have an I7 here as well so I will test on that also to compare results.

Will let you know  :t
Title: Re: Floating point arithmetic question
Post by: jj2007 on April 25, 2018, 09:42:08 AM
Weird ::)Intel(R) Core(TM) i5-2450M CPU @ 2.50GHz (SSE4)

170     cycles for 100 * fld Real4 mem, mem
269     cycles for 100 * fld Real4 mem, st
169     cycles for 100 * fld Real8 mem, mem
270     cycles for 100 * fld Real8 mem, st
372     cycles for 100 * fld Real10 mem, mem
372     cycles for 100 * fld Real10 mem, st

169     cycles for 100 * fld Real4 mem, mem
267     cycles for 100 * fld Real4 mem, st
169     cycles for 100 * fld Real8 mem, mem
267     cycles for 100 * fld Real8 mem, st
372     cycles for 100 * fld Real10 mem, mem
374     cycles for 100 * fld Real10 mem, st

169     cycles for 100 * fld Real4 mem, mem
269     cycles for 100 * fld Real4 mem, st
169     cycles for 100 * fld Real8 mem, mem
267     cycles for 100 * fld Real8 mem, st
373     cycles for 100 * fld Real10 mem, mem
373     cycles for 100 * fld Real10 mem, st

16      bytes for fld Real4 mem, mem
12      bytes for fld Real4 mem, st
16      bytes for fld Real8 mem, mem
12      bytes for fld Real8 mem, st
16      bytes for fld Real10 mem, mem
12      bytes for fld Real10 mem, st

Intel(R) Celeron(R) CPU  N2840  @ 2.16GHz (SSE4)

166     cycles for 100 * fld Real4 mem, mem
165     cycles for 100 * fld Real4 mem, st
175     cycles for 100 * fld Real8 mem, mem
168     cycles for 100 * fld Real8 mem, st
1029    cycles for 100 * fld Real10 mem, mem
596     cycles for 100 * fld Real10 mem, st

163     cycles for 100 * fld Real4 mem, mem
163     cycles for 100 * fld Real4 mem, st
163     cycles for 100 * fld Real8 mem, mem
168     cycles for 100 * fld Real8 mem, st
1041    cycles for 100 * fld Real10 mem, mem
602     cycles for 100 * fld Real10 mem, st

170     cycles for 100 * fld Real4 mem, mem
164     cycles for 100 * fld Real4 mem, st
174     cycles for 100 * fld Real8 mem, mem
169     cycles for 100 * fld Real8 mem, st
1056    cycles for 100 * fld Real10 mem, mem
611     cycles for 100 * fld Real10 mem, st
Title: Re: Floating point arithmetic question
Post by: Siekmanski on April 25, 2018, 09:56:12 AM
Intel(R) Core(TM) i7-4930K CPU @ 3.40GHz (SSE4)

168     cycles for 100 * fld Real4 mem, mem
266     cycles for 100 * fld Real4 mem, st
168     cycles for 100 * fld Real8 mem, mem
272     cycles for 100 * fld Real8 mem, st
373     cycles for 100 * fld Real10 mem, mem
373     cycles for 100 * fld Real10 mem, st

167     cycles for 100 * fld Real4 mem, mem
268     cycles for 100 * fld Real4 mem, st
168     cycles for 100 * fld Real8 mem, mem
268     cycles for 100 * fld Real8 mem, st
373     cycles for 100 * fld Real10 mem, mem
372     cycles for 100 * fld Real10 mem, st

167     cycles for 100 * fld Real4 mem, mem
266     cycles for 100 * fld Real4 mem, st
168     cycles for 100 * fld Real8 mem, mem
267     cycles for 100 * fld Real8 mem, st
373     cycles for 100 * fld Real10 mem, mem
373     cycles for 100 * fld Real10 mem, st

16      bytes for fld Real4 mem, mem
12      bytes for fld Real4 mem, st
16      bytes for fld Real8 mem, mem
12      bytes for fld Real8 mem, st
16      bytes for fld Real10 mem, mem
12      bytes for fld Real10 mem, st


--- ok ---
Title: Re: Floating point arithmetic question
Post by: HSE on April 25, 2018, 10:17:25 AM

AMD A6-3500 APU with Radeon(tm) HD Graphics (SSE3)

12      cycles for 100 * fld Real4 mem, mem
11      cycles for 100 * fld Real4 mem, st
12      cycles for 100 * fld Real8 mem, mem
11      cycles for 100 * fld Real8 mem, st
612     cycles for 100 * fld Real10 mem, mem
318     cycles for 100 * fld Real10 mem, st

12      cycles for 100 * fld Real4 mem, mem
10      cycles for 100 * fld Real4 mem, st
11      cycles for 100 * fld Real8 mem, mem
11      cycles for 100 * fld Real8 mem, st
612     cycles for 100 * fld Real10 mem, mem
317     cycles for 100 * fld Real10 mem, st

13      cycles for 100 * fld Real4 mem, mem
9       cycles for 100 * fld Real4 mem, st
12      cycles for 100 * fld Real8 mem, mem
10      cycles for 100 * fld Real8 mem, st
612     cycles for 100 * fld Real10 mem, mem
319     cycles for 100 * fld Real10 mem, st

16      bytes for fld Real4 mem, mem
12      bytes for fld Real4 mem, st
16      bytes for fld Real8 mem, mem
12      bytes for fld Real8 mem, st
16      bytes for fld Real10 mem, mem
12      bytes for fld Real10 mem, st


--- ok ---

What happen here? Cycles are so slower than the other machines?
Title: Re: Floating point arithmetic question
Post by: jj2007 on April 25, 2018, 04:45:39 PM
Quote from: HSE on April 25, 2018, 10:17:25 AMWhat happen here? Cycles are so slower than the other machines?

No idea how this can run in 11 cycles (not including the loop overhead, though) ::)
  mov ebx, 99 ; loop 100x
  align 4
  .Repeat
fld MyR4
fld MyR4
fstp st
fstp st
dec ebx
  .Until Sign?
Title: Re: Floating point arithmetic question
Post by: Lonewolff on April 25, 2018, 04:52:40 PM
So I stumbled across something hey?  :lol:

Heaps faster all round for real4's to do two fld's. :t

Maybe a caching thing in the CPU itself? Knows it already has the value there so just re-uses it perhaps?
Title: Re: Floating point arithmetic question
Post by: RuiLoureiro on April 25, 2018, 11:21:16 PM
Quote from: jj2007 on April 25, 2018, 04:45:39 PM
Quote from: HSE on April 25, 2018, 10:17:25 AMWhat happen here? Cycles are so slower than the other machines?

No idea how this can run in 11 cycles (not including the loop overhead, though) ::)
Quote
  mov ebx, 99   ; loop 100x    <<<< THIS IS FOR real4, mem, mem ?
  align 4
  .Repeat
   fld MyR4
   fld MyR4        <<<<<<<<<<
   fstp st
   fstp st
   dec ebx
  .Until Sign?
-----------------------------
mov  ebx, 99                         <<<<< this is for real4, mem, st ?
align 4
.Repeat
  fld MyR4
  fld st        <<<<<<<<<<<<<<<<
  fstp st
  fstp st
  dec ebx
.Until Sign?
Good work Jochen  :t
Is this what you are doing, correct ?
Now i want to say that i never used real4 or real 8. Only real 10.
Title: Re: Floating point arithmetic question
Post by: RuiLoureiro on April 25, 2018, 11:32:05 PM
Quote
Intel(R) Atom(TM) CPU N455   @ 1.66GHz (SSE4)

624     cycles for 100 * fld Real4 mem, mem
596     cycles for 100 * fld Real4 mem, st

603     cycles for 100 * fld Real8 mem, mem
595     cycles for 100 * fld Real8 mem, st

1426    cycles for 100 * fld Real10 mem, mem
1015    cycles for 100 * fld Real10 mem, st


604     cycles for 100 * fld Real4 mem, mem
595     cycles for 100 * fld Real4 mem, st

600     cycles for 100 * fld Real8 mem, mem
597     cycles for 100 * fld Real8 mem, st

1427    cycles for 100 * fld Real10 mem, mem
1002    cycles for 100 * fld Real10 mem, st


613     cycles for 100 * fld Real4 mem, mem
595     cycles for 100 * fld Real4 mem, st

615     cycles for 100 * fld Real8 mem, mem
599     cycles for 100 * fld Real8 mem, st

1416    cycles for 100 * fld Real10 mem, mem
1355    cycles for 100 * fld Real10 mem, st

16      bytes for fld Real4 mem, mem
12      bytes for fld Real4 mem, st
16      bytes for fld Real8 mem, mem
12      bytes for fld Real8 mem, st
16      bytes for fld Real10 mem, mem
12      bytes for fld Real10 mem, st


-

Title: Re: Floating point arithmetic question
Post by: FORTRANS on April 25, 2018, 11:34:04 PM
F:\TEMP\TEST>fld_mem_
Intel(R) Pentium(R) M processor 1.70GHz (SSE2)

293     cycles for 100 * fld Real4 mem, mem
192     cycles for 100 * fld Real4 mem, st
298     cycles for 100 * fld Real8 mem, mem
186     cycles for 100 * fld Real8 mem, st
508     cycles for 100 * fld Real10 mem, mem
423     cycles for 100 * fld Real10 mem, st

290     cycles for 100 * fld Real4 mem, mem
193     cycles for 100 * fld Real4 mem, st
292     cycles for 100 * fld Real8 mem, mem
192     cycles for 100 * fld Real8 mem, st
508     cycles for 100 * fld Real10 mem, mem
423     cycles for 100 * fld Real10 mem, st

289     cycles for 100 * fld Real4 mem, mem
198     cycles for 100 * fld Real4 mem, st
296     cycles for 100 * fld Real8 mem, mem
193     cycles for 100 * fld Real8 mem, st
512     cycles for 100 * fld Real10 mem, mem
425     cycles for 100 * fld Real10 mem, st

16      bytes for fld Real4 mem, mem
12      bytes for fld Real4 mem, st
16      bytes for fld Real8 mem, mem
12      bytes for fld Real8 mem, st
16      bytes for fld Real10 mem, mem
12      bytes for fld Real10 mem, st


--- ok ---
Title: Re: Floating point arithmetic question
Post by: jj2007 on April 25, 2018, 11:49:08 PM
Interesting results ::)
- an AMD that does impossible things
- recent Intel i5/i7 where mem access is faster
- mobile cpus like Celeron and Pentium M where fld st is faster.

What does that teach us? Nothing 8)
Title: Re: Floating point arithmetic question
Post by: Siekmanski on April 26, 2018, 12:34:23 AM
Making fast routines isn't easy in modern times.  :(
We need a database for optimizations per system.  :bgrin:
Title: Re: Floating point arithmetic question
Post by: hutch-- on April 26, 2018, 02:54:34 AM
> What does that teach us?

Balanced algorithms with testing spread across mixed hardware.
Title: Re: Floating point arithmetic question
Post by: daydreamer on April 26, 2018, 04:00:15 AM
Quote from: Siekmanski on April 26, 2018, 12:34:23 AM
Making fast routines isn't easy in modern times.  :(
We need a database for optimizations per system.  :bgrin:
make a similar code in masm, like JAVA's Just In Time (JIT) compiler
that first runs and times first code snippet,after that runs second version of code snippet etc and finally compare what runs fastest and keep code to only jump to that version of the code and put up a flag that signals that testing which is fastest is over
Title: Re: Floating point arithmetic question
Post by: Siekmanski on April 26, 2018, 07:06:26 AM
Nah, I prefer Hutches approach.
Title: Re: Floating point arithmetic question
Post by: RuiLoureiro on April 26, 2018, 07:10:25 AM
Quote from: Siekmanski on April 26, 2018, 07:06:26 AM
Nah, I prefer Hutches approach.
Me too  :t
On your i7-4930K the results are, more or less, 373 cycles for real10, in both cases...
Title: Re: Floating point arithmetic question
Post by: RuiLoureiro on April 26, 2018, 07:26:32 AM
Here it is another
Quote
Intel(R) Pentium(R) 4 CPU 3.40GHz (SSE3)

193     cycles for 100 * fld Real4 mem, mem
174     cycles for 100 * fld Real4 mem, st

178     cycles for 100 * fld Real8 mem, mem
162     cycles for 100 * fld Real8 mem, st

1644    cycles for 100 * fld Real10 mem, mem
838     cycles for 100 * fld Real10 mem, st
------------------------

177     cycles for 100 * fld Real4 mem, mem
158     cycles for 100 * fld Real4 mem, st

177     cycles for 100 * fld Real8 mem, mem
174     cycles for 100 * fld Real8 mem, st

1644    cycles for 100 * fld Real10 mem, mem
847     cycles for 100 * fld Real10 mem, st
------------------------

193     cycles for 100 * fld Real4 mem, mem
175     cycles for 100 * fld Real4 mem, st

179     cycles for 100 * fld Real8 mem, mem
159     cycles for 100 * fld Real8 mem, st

1649    cycles for 100 * fld Real10 mem, mem
838     cycles for 100 * fld Real10 mem, st
------------------------

16      bytes for fld Real4 mem, mem
12      bytes for fld Real4 mem, st
16      bytes for fld Real8 mem, mem
12      bytes for fld Real8 mem, st
16      bytes for fld Real10 mem, mem
12      bytes for fld Real10 mem, st


--- ok ---
Title: Re: Floating point arithmetic question
Post by: FORTRANS on April 26, 2018, 11:05:12 PM
Hi Jochen,


Quote from: jj2007 on April 25, 2018, 11:49:08 PM
Interesting results ::)
- an AMD that does impossible things
- recent Intel i5/i7 where mem access is faster
- mobile cpus like Celeron and Pentium M where fld st is faster.

What does that teach us? Nothing 8)

Quote from: jj2007 on April 25, 2018, 04:45:39 PM
Quote from: HSE on April 25, 2018, 10:17:25 AMWhat happen here? Cycles are so slower than the other machines?

No idea how this can run in 11 cycles (not including the loop overhead, though) ::)
  mov ebx, 99 ; loop 100x
  align 4
  .Repeat
fld MyR4
fld MyR4
fstp st
fstp st
dec ebx
  .Until Sign?


   An idea occurred to me.*  Loading MyR4 twice may be cached
by the FPU?  Would making a 200 element array, and loading a
different value {say MyR4[X],MyR4[X+400]} change the timing
any?  Sorry if this is a silly question, the AMD results just looked
too odd.

Regards,

Steve N.

* Yes, it does happen.  Of course it doesn't imply a good one.
Title: Re: Floating point arithmetic question
Post by: HSE on April 26, 2018, 11:22:26 PM
AMD A6-3500 APU with Radeon(tm) HD Graphics (SSE3)

11      cycles for 100 * fld Real4 mem, mem
7       cycles for 100 * fld Real4 mem, st
50      cycles for 100 * fld Real8 mem, mem
9       cycles for 100 * fld Real8 mem, st
600     cycles for 100 * fld Real10 mem, mem
309     cycles for 100 * fld Real10 mem, st

10      cycles for 100 * fld Real4 mem, mem
7       cycles for 100 * fld Real4 mem, st
50      cycles for 100 * fld Real8 mem, mem
7       cycles for 100 * fld Real8 mem, st
602     cycles for 100 * fld Real10 mem, mem
309     cycles for 100 * fld Real10 mem, st

9       cycles for 100 * fld Real4 mem, mem
7       cycles for 100 * fld Real4 mem, st
52      cycles for 100 * fld Real8 mem, mem
7       cycles for 100 * fld Real8 mem, st
600     cycles for 100 * fld Real10 mem, mem
309     cycles for 100 * fld Real10 mem, st

16      bytes for fld Real4 mem, mem
12      bytes for fld Real4 mem, st
16      bytes for fld Real8 mem, mem
12      bytes for fld Real8 mem, st
16      bytes for fld Real10 mem, mem
12      bytes for fld Real10 mem, st


--- ok ---


Using myR4 and myR4b, myR8 and myR8b, and myR10 and myR10b.

Note that R8 is slower now, but not different with R4.
Title: Re: Floating point arithmetic question
Post by: jj2007 on April 26, 2018, 11:34:14 PM
Quote from: FORTRANS on April 26, 2018, 11:05:12 PMAn idea occurred to me.*  Loading MyR4 twice may be cached
by the FPU?  Would making a 200 element array, and loading a
different value {say MyR4[X],MyR4[X+400]} change the timing
any?  Sorry if this is a silly question, the AMD results just looked
too odd.

Steve,
That is definitely not a silly question, so I tested it with randomly filled arrays:  mov ebx, AlgoLoops-1
  align 4
  .Repeat
fld MyR4[4*ebx]
fld MyR4[4*ebx]
fstp st
fstp st
dec ebx
  .Until Sign?

Results with AlgoLoops=10,000:
Intel(R) Core(TM) i5-2450M CPU @ 2.50GHz (SSE4)

20248   cycles for 10000 * fld Real4 mem, mem
30267   cycles for 10000 * fld Real4 mem, st
20704   cycles for 10000 * fld Real8 mem, mem
30563   cycles for 10000 * fld Real8 mem, st
40031   cycles for 10000 * fld Real10 mem, mem
40005   cycles for 10000 * fld Real10 mem, st

20247   cycles for 10000 * fld Real4 mem, mem
30314   cycles for 10000 * fld Real4 mem, st
20708   cycles for 10000 * fld Real8 mem, mem
30512   cycles for 10000 * fld Real8 mem, st
40018   cycles for 10000 * fld Real10 mem, mem
40026   cycles for 10000 * fld Real10 mem, st

20248   cycles for 10000 * fld Real4 mem, mem
30272   cycles for 10000 * fld Real4 mem, st
20755   cycles for 10000 * fld Real8 mem, mem
30492   cycles for 10000 * fld Real8 mem, st
40013   cycles for 10000 * fld Real10 mem, mem
40015   cycles for 10000 * fld Real10 mem, st


In short: My i5 couldn't care less :P

Now the Celeron:Intel(R) Celeron(R) CPU  N2840  @ 2.16GHz (SSE4)
16354   cycles for 10000 * fld Real4 mem, mem
15524   cycles for 10000 * fld Real4 mem, st
18059   cycles for 10000 * fld Real8 mem, mem
17376   cycles for 10000 * fld Real8 mem, st
98249   cycles for 10000 * fld Real10 mem, mem
57020   cycles for 10000 * fld Real10 mem, st

16382   cycles for 10000 * fld Real4 mem, mem
17095   cycles for 10000 * fld Real4 mem, st
18286   cycles for 10000 * fld Real8 mem, mem
17329   cycles for 10000 * fld Real8 mem, st
99281   cycles for 10000 * fld Real10 mem, mem
56412   cycles for 10000 * fld Real10 mem, st

15761   cycles for 10000 * fld Real4 mem, mem
16297   cycles for 10000 * fld Real4 mem, st
17914   cycles for 10000 * fld Real8 mem, mem
18403   cycles for 10000 * fld Real8 mem, st
98288   cycles for 10000 * fld Real10 mem, mem
56802   cycles for 10000 * fld Real10 mem, st
Title: Re: Floating point arithmetic question
Post by: HSE on April 26, 2018, 11:59:10 PM

AMD A6-3500 APU with Radeon(tm) HD Graphics (SSE3)

1811    cycles for 10000 * fld Real4 mem, mem
1619    cycles for 10000 * fld Real4 mem, st
3760    cycles for 10000 * fld Real8 mem, mem
??      cycles for 10000 * fld Real8 mem, st
55645   cycles for 10000 * fld Real10 mem, mem
32089   cycles for 10000 * fld Real10 mem, st

1313    cycles for 10000 * fld Real4 mem, mem
??      cycles for 10000 * fld Real4 mem, st
5816    cycles for 10000 * fld Real8 mem, mem
1112    cycles for 10000 * fld Real8 mem, st
61808   cycles for 10000 * fld Real10 mem, mem
31172   cycles for 10000 * fld Real10 mem, st

??      cycles for 10000 * fld Real4 mem, mem
??      cycles for 10000 * fld Real4 mem, st
3241    cycles for 10000 * fld Real8 mem, mem
??      cycles for 10000 * fld Real8 mem, st
61940   cycles for 10000 * fld Real10 mem, mem
27024   cycles for 10000 * fld Real10 mem, st

18      bytes for fld Real4 mem, mem
13      bytes for fld Real4 mem, st
18      bytes for fld Real8 mem, mem
13      bytes for fld Real8 mem, st
16      bytes for fld Real10 mem, mem
12      bytes for fld Real10 mem, st


--- ok ---


8)
now with array in .data?
AMD A6-3500 APU with Radeon(tm) HD Graphics (SSE3)

289     cycles for 10000 * fld Real4 mem, mem
1526    cycles for 10000 * fld Real4 mem, st
1665    cycles for 10000 * fld Real8 mem, mem
2670    cycles for 10000 * fld Real8 mem, st
51864   cycles for 10000 * fld Real10 mem, mem
25661   cycles for 10000 * fld Real10 mem, st

270     cycles for 10000 * fld Real4 mem, mem
1552    cycles for 10000 * fld Real4 mem, st
726     cycles for 10000 * fld Real8 mem, mem
2681    cycles for 10000 * fld Real8 mem, st
61508   cycles for 10000 * fld Real10 mem, mem
25214   cycles for 10000 * fld Real10 mem, st

1824    cycles for 10000 * fld Real4 mem, mem
??      cycles for 10000 * fld Real4 mem, st
177     cycles for 10000 * fld Real8 mem, mem
3478    cycles for 10000 * fld Real8 mem, st
51708   cycles for 10000 * fld Real10 mem, mem
31514   cycles for 10000 * fld Real10 mem, st

18      bytes for fld Real4 mem, mem
13      bytes for fld Real4 mem, st
18      bytes for fld Real8 mem, mem
13      bytes for fld Real8 mem, st
16      bytes for fld Real10 mem, mem
12      bytes for fld Real10 mem, st


--- ok ---


fld RealXX mem, mem  faster than  fld RealXX mem, st ??
Title: Re: Floating point arithmetic question
Post by: RuiLoureiro on April 27, 2018, 12:04:37 AM
?? cycles ... ? what it means ?
Title: Re: Floating point arithmetic question
Post by: jj2007 on April 27, 2018, 12:16:32 AM
Quote from: RuiLoureiro on April 27, 2018, 12:04:37 AM
?? cycles ... ? what it means ?

It means the timing procedure could not establish a valid value. There is something strange with the AMD:
289     cycles for 10000 * fld Real4 mem, mem

This is simply impossible. My template uses cpuid + rdtsc from Michael Webster's timer macros (http://the%20http://masm32.com/board/index.php?topic=49.0); no idea what could happen there:
counter_begin TimerLoops, HIGH_PRIORITY_CLASS
call TestA
counter_end
ShowCycles TestA


Could it be that the loop overhead is wrong, or that the spinup loop is too short? Attached a special version for HSE that shows the overhead, and uses once a much longer spinup loop (20x):
Intel(R) Core(TM) i5-2450M CPU @ 2.50GHz (SSE4)
loop overhead is approx. 10067/10000 cycles

20191   cycles for 10000 * fld Real4 mem, mem
30011   cycles for 10000 * fld Real4 mem, st
20753   cycles for 10000 * fld Real8 mem, mem
30031   cycles for 10000 * fld Real8 mem, st
39976   cycles for 10000 * fld Real10 mem, mem
40001   cycles for 10000 * fld Real10 mem, st
Title: Re: Floating point arithmetic question
Post by: FORTRANS on April 27, 2018, 12:44:51 AM
Hi,

   Thanks for testing.

Quote from: jj2007 on April 26, 2018, 11:34:14 PM

Steve,
That is definitely not a silly question, so I tested it with randomly filled arrays:

    fld MyR4[4*ebx]
fld MyR4[4*ebx]


   But again you load the same value twice in a row.  That's why I
put in a +400 for the second load.  But HSE's results show that
probably would not change things on his AMD.

Thanks,

Steve
Title: Re: Floating point arithmetic question
Post by: jimg on April 27, 2018, 01:37:21 AM
AMD Phenom(tm) II X6 1045T Processor (SSE3)
Spinup done
loop overhead is approx. 20057/10000 cycles

879     cycles for 10000 * fld Real4 mem, mem
6518    cycles for 10000 * fld Real4 mem, st
3628    cycles for 10000 * fld Real8 mem, mem
4023    cycles for 10000 * fld Real8 mem, st
60146   cycles for 10000 * fld Real10 mem, mem
47240   cycles for 10000 * fld Real10 mem, st

841     cycles for 10000 * fld Real4 mem, mem
1091    cycles for 10000 * fld Real4 mem, st
3911    cycles for 10000 * fld Real8 mem, mem
5221    cycles for 10000 * fld Real8 mem, st
60101   cycles for 10000 * fld Real10 mem, mem
43169   cycles for 10000 * fld Real10 mem, st

695     cycles for 10000 * fld Real4 mem, mem
870     cycles for 10000 * fld Real4 mem, st
4738    cycles for 10000 * fld Real8 mem, mem
4917    cycles for 10000 * fld Real8 mem, st
60153   cycles for 10000 * fld Real10 mem, mem
35096   cycles for 10000 * fld Real10 mem, st

18      bytes for fld Real4 mem, mem
13      bytes for fld Real4 mem, st
18      bytes for fld Real8 mem, mem
13      bytes for fld Real8 mem, st
16      bytes for fld Real10 mem, mem
12      bytes for fld Real10 mem, st


-
Title: Re: Floating point arithmetic question
Post by: Siekmanski on April 27, 2018, 01:43:22 AM
Intel(R) Core(TM) i7-4930K CPU @ 3.40GHz (SSE4)
Spinup done
loop overhead is approx. 10454/10000 cycles

20329   cycles for 10000 * fld Real4 mem, mem
29593   cycles for 10000 * fld Real4 mem, st
21202   cycles for 10000 * fld Real8 mem, mem
29582   cycles for 10000 * fld Real8 mem, st
39593   cycles for 10000 * fld Real10 mem, mem
39595   cycles for 10000 * fld Real10 mem, st

19664   cycles for 10000 * fld Real4 mem, mem
29596   cycles for 10000 * fld Real4 mem, st
19650   cycles for 10000 * fld Real8 mem, mem
29598   cycles for 10000 * fld Real8 mem, st
39594   cycles for 10000 * fld Real10 mem, mem
39597   cycles for 10000 * fld Real10 mem, st

19614   cycles for 10000 * fld Real4 mem, mem
29630   cycles for 10000 * fld Real4 mem, st
19701   cycles for 10000 * fld Real8 mem, mem
29659   cycles for 10000 * fld Real8 mem, st
39591   cycles for 10000 * fld Real10 mem, mem
39590   cycles for 10000 * fld Real10 mem, st

18      bytes for fld Real4 mem, mem
13      bytes for fld Real4 mem, st
18      bytes for fld Real8 mem, mem
13      bytes for fld Real8 mem, st
16      bytes for fld Real10 mem, mem
12      bytes for fld Real10 mem, st


--- ok ---
Title: Re: Floating point arithmetic question
Post by: jj2007 on April 27, 2018, 03:32:49 AM
Quote from: FORTRANS on April 27, 2018, 12:44:51 AMBut again you load the same value twice in a row.

OK, since you insist :P
fld MyR4[4*ebx]
fld MyR4[4*ebx+4*2000]


Note that Jim's AMD needs twice as much time for the naked loop; it could be that the AMD does loop and fpu loads simultaneously. Agner (http://www.agner.org/optimize/blog/read.php?i=838) says the latest Ryzen is faster:
QuoteThis makes it possible to execute a tiny loop with up to six instructions in one clock cycle per iteration

Intel(R) Core(TM) i5-2450M CPU @ 2.50GHz (SSE4)
Spinup done, 4,200,000,000*dec eax
loop overhead is approx. 10112/10000 cycles

20271   cycles for 10000 * fld Real4 mem, mem
30139   cycles for 10000 * fld Real4 mem, st
20586   cycles for 10000 * fld Real8 mem, mem
30194   cycles for 10000 * fld Real8 mem, st
42013   cycles for 10000 * fld Real10 mem, mem
40749   cycles for 10000 * fld Real10 mem, st

20330   cycles for 10000 * fld Real4 mem, mem
30232   cycles for 10000 * fld Real4 mem, st
20648   cycles for 10000 * fld Real8 mem, mem
30324   cycles for 10000 * fld Real8 mem, st
41906   cycles for 10000 * fld Real10 mem, mem
40546   cycles for 10000 * fld Real10 mem, st

20365   cycles for 10000 * fld Real4 mem, mem
30313   cycles for 10000 * fld Real4 mem, st
20606   cycles for 10000 * fld Real8 mem, mem
30312   cycles for 10000 * fld Real8 mem, st
42380   cycles for 10000 * fld Real10 mem, mem
41474   cycles for 10000 * fld Real10 mem, st
Title: Re: Floating point arithmetic question
Post by: FORTRANS on April 27, 2018, 05:12:40 AM
Hi Jochen,

   Thank you again.  Now my results have changed.  So I am
confused once more.

F:\TEMP\TEST>FLD_MEM_.exe
Intel(R) Pentium(R) M processor 1.70GHz (SSE2)
Spinup done, 4,200,000,000*dec eax in 2695 ms
loop overhead is approx. 10082/10000 cycles

22465   cycles for 10000 * fld Real4 mem, mem
25357   cycles for 10000 * fld Real4 mem, st
27534   cycles for 10000 * fld Real8 mem, mem
28920   cycles for 10000 * fld Real8 mem, st
75225   cycles for 10000 * fld Real10 mem, mem
52567   cycles for 10000 * fld Real10 mem, st

22216   cycles for 10000 * fld Real4 mem, mem
26432   cycles for 10000 * fld Real4 mem, st
25397   cycles for 10000 * fld Real8 mem, mem
27759   cycles for 10000 * fld Real8 mem, st
70948   cycles for 10000 * fld Real10 mem, mem
51487   cycles for 10000 * fld Real10 mem, st

20858   cycles for 10000 * fld Real4 mem, mem
25416   cycles for 10000 * fld Real4 mem, st
24065   cycles for 10000 * fld Real8 mem, mem
28251   cycles for 10000 * fld Real8 mem, st
70658   cycles for 10000 * fld Real10 mem, mem
51233   cycles for 10000 * fld Real10 mem, st


--- ok ---


   At least the Real10 results did not swap places.

Regards,

Steve N.
Title: Re: Floating point arithmetic question
Post by: RuiLoureiro on April 27, 2018, 07:34:00 AM
Results for the last version
Quote
Intel(R) Pentium(R) 4 CPU 3.40GHz (SSE3)
Spinup done, 4,200,000,000*dec eax in 2041 ms
loop overhead is approx. 18160/10000 cycles

33459   cycles for 10000 * fld Real4 mem, mem
37247   cycles for 10000 * fld Real4 mem, st
49295   cycles for 10000 * fld Real8 mem, mem
45967   cycles for 10000 * fld Real8 mem, st
186538 cycles for 10000 * fld Real10 mem, mem
108538  cycles for 10000 * fld Real10 mem, st

31169   cycles for 10000 * fld Real4 mem, mem
37200   cycles for 10000 * fld Real4 mem, st
41419   cycles for 10000 * fld Real8 mem, mem
46164   cycles for 10000 * fld Real8 mem, st
184284 cycles for 10000 * fld Real10 mem, mem
108466  cycles for 10000 * fld Real10 mem, st

31042   cycles for 10000 * fld Real4 mem, mem
37448   cycles for 10000 * fld Real4 mem, st
42335   cycles for 10000 * fld Real8 mem, mem
46058   cycles for 10000 * fld Real8 mem, st
184647 cycles for 10000 * fld Real10 mem, mem
110142 cycles for 10000 * fld Real10 mem, st


--- ok ---
Title: Re: Floating point arithmetic question
Post by: HSE on April 27, 2018, 10:15:34 AM
AMD A6-3500 APU with Radeon(tm) HD Graphics (SSE3)
Spinup done, 4,200,000,000*dec eax in 3757 ms
loop overhead is approx. 19715/10000 cycles

10047   cycles for 10000 * fld Real4 mem, mem
7442    cycles for 10000 * fld Real4 mem, st
13013   cycles for 10000 * fld Real8 mem, mem
9412    cycles for 10000 * fld Real8 mem, st
65984   cycles for 10000 * fld Real10 mem, mem
27897   cycles for 10000 * fld Real10 mem, st

7320    cycles for 10000 * fld Real4 mem, mem
11115   cycles for 10000 * fld Real4 mem, st
12497   cycles for 10000 * fld Real8 mem, mem
12401   cycles for 10000 * fld Real8 mem, st
63567   cycles for 10000 * fld Real10 mem, mem
33683   cycles for 10000 * fld Real10 mem, st

7254    cycles for 10000 * fld Real4 mem, mem
10440   cycles for 10000 * fld Real4 mem, st
12343   cycles for 10000 * fld Real8 mem, mem
7372    cycles for 10000 * fld Real8 mem, st
63265   cycles for 10000 * fld Real10 mem, mem
30161   cycles for 10000 * fld Real10 mem, st


--- ok ---
Title: Re: Floating point arithmetic question
Post by: Lonewolff on April 27, 2018, 10:39:23 AM
What are you guys using to run this?
Title: Re: Floating point arithmetic question
Post by: jj2007 on April 27, 2018, 10:46:38 AM
The exe attached to Reply #65.

AMD A6-3500 APU with Radeon(tm) HD Graphics (SSE3)
Spinup done, 4,200,000,000*dec eax in 3757 ms
loop overhead is approx. 19715/10000 cycles


So the AMD takes a long time to get to full speed, and has a high loop overhead. Hm :(
Title: Re: Floating point arithmetic question
Post by: Lonewolff on April 27, 2018, 10:52:29 AM
Cool  :t


16575   cycles for 10000 * fld Real4 mem, mem
18297   cycles for 10000 * fld Real4 mem, st
26425   cycles for 10000 * fld Real8 mem, mem
19634   cycles for 10000 * fld Real8 mem, st
79024   cycles for 10000 * fld Real10 mem, mem
42922   cycles for 10000 * fld Real10 mem, st

16270   cycles for 10000 * fld Real4 mem, mem
16712   cycles for 10000 * fld Real4 mem, st
24982   cycles for 10000 * fld Real8 mem, mem
17107   cycles for 10000 * fld Real8 mem, st
77920   cycles for 10000 * fld Real10 mem, mem
44588   cycles for 10000 * fld Real10 mem, st

17735   cycles for 10000 * fld Real4 mem, mem
16897   cycles for 10000 * fld Real4 mem, st
23506   cycles for 10000 * fld Real8 mem, mem
23064   cycles for 10000 * fld Real8 mem, st
79142   cycles for 10000 * fld Real10 mem, mem
41752   cycles for 10000 * fld Real10 mem, st
Title: Re: Floating point arithmetic question
Post by: HSE on April 27, 2018, 10:57:37 AM
Pipeline?
Title: Re: Floating point arithmetic question
Post by: Lonewolff on April 27, 2018, 11:02:41 AM
Quote from: HSE on April 27, 2018, 10:57:37 AM
Pipeline?

Pipeline?

Oh, AMD A8-8320
Title: Re: Floating point arithmetic question
Post by: HSE on April 27, 2018, 01:39:53 PM
Sorry Lone I wass asking other thing to JJ.

Pipeline is the nickname of a RISC processors feature.They process an instruction, search memory of the next instruction and read the following instruction in the same cycle (or something like that  :biggrin:).