Text only | Text with Images

The MASM Forum

General => The Campus => Topic started by: Lonewolff on April 16, 2018, 08:20:24 PM

Title: Floating point arithmetic question
Post by: Lonewolff on April 16, 2018, 08:20:24 PM

Hi guys,

I am trying to work out how to do a simple floating point subtraction but the result is incorrect.

Code Select


	fild valA
	fild valB		
	fsub
	fstp result

valA is a real4 of 1000.0 and valB is a real4 of 1.0

I was expecting a result of 999.0 but I get -8.34929E+07.

Any advice would be much appreciated 8)

Title: Re: Floating point arithmetic question
Post by: Lonewolff on April 16, 2018, 08:32:37 PM

Ah! Worked it out. I should have been using fld not fild :eusa_clap:

Title: Re: Floating point arithmetic question
Post by: daydreamer on April 16, 2018, 08:58:39 PM

For simple math and sqrt and rsqrt use SSE instead,its easier to make code faster in a loop
movss xmm0,val1
subss xmm0,val2
movss result,xmm0
And when you need speedup,use movaps,subps etc instead

Title: Re: Floating point arithmetic question
Post by: Lonewolff on April 16, 2018, 09:05:22 PM

Awesome! Thanks for the tip 8)

Title: Re: Floating point arithmetic question
Post by: jj2007 on April 16, 2018, 09:22:25 PM

Quote from: daydreamer on April 16, 2018, 08:58:39 PM
For simple math and sqrt and rsqrt use SSE instead,its easier to make code faster in a loop
movss xmm0,val1
subss xmm0,val2
movss result,xmm0

And if you are not absolutely sure, time it:

Code Select

Intel(R) Core(TM) i5-2450M CPU @ 2.50GHz (SSE4)

87      cycles for 100 * movss
59      cycles for 100 * fpu

80      cycles for 100 * movss
59      cycles for 100 * fpu

80      cycles for 100 * movss
59      cycles for 100 * fpu

84      cycles for 100 * movss
59      cycles for 100 * fpu

80      cycles for 100 * movss
60      cycles for 100 * fpu

24      bytes for movss
18      bytes for fpu

movss:

Code Select

	movss xmm0, val1
	subss xmm0, val2
	movss result, xmm0

fpu:

Code Select

	fld val1
	fsub val2
	fstp result

Title: Re: Floating point arithmetic question
Post by: Lonewolff on April 16, 2018, 09:33:46 PM

Yep. Been timing routines heaps today.

Trying to make them as tight as possible as they are performance critical for my project.

Lot's of testing, timing, and learning going on :biggrin:

Title: Re: Floating point arithmetic question
Post by: Lonewolff on April 16, 2018, 10:22:06 PM

Having another small issue.

I am trying to calculate the tangent of a value using the fp* commands.

Code Select


	fld VALUE			;// contains 0.523599
	fptan
	fstp RESULT		;// should be 0.57735

According to ConverterDD (http://masm32.com/board/index.php?topic=1819.0) the value in RESULT is NaN.

ConverterDD has being displaying all results correctly so far.

Am I using fptan correctly?

Title: Re: Floating point arithmetic question
Post by: jj2007 on April 16, 2018, 10:43:20 PM

Quote from: Lonewolff on April 16, 2018, 10:22:06 PMAm I using fptan correctly?

What does the help file say?

include \masm32\MasmBasic\MasmBasic.inc ; download (http://masm32.com/board/index.php?topic=94.0)
Init
fld FP4(0.523599)
fptan
deb 4, "The FPU:", ST(0), ST(1)
EndOfCode

Code Select

The FPU:
ST(0)           1.000000000000000000
ST(1)           0.5773506065083982818

Title: Re: Floating point arithmetic question
Post by: RuiLoureiro on April 16, 2018, 11:45:22 PM

Quote from: Lonewolff on April 16, 2018, 10:22:06 PM
Having another small issue.

I am trying to calculate the tangent of a value using the fp* commands.

Quote
fld VALUE ;// contains 0.523599
fptan
fstp st ; remove 1.0 from st(0)<<-- See Simply FPU by Raymond
fstp RESULT ;// should be 0.57735

According to ConverterDD (http://masm32.com/board/index.php?topic=1819.0 (http://masm32.com/board/index.php?topic=1819.0)) the value in RESULT is NaN.

ConverterDD has being displaying all results correctly so far.

Am I using fptan correctly? <<<--- NO, it seems you want tan(VALUE)=0.5773505683919327

Hi
Could you post this simple example (the asm file) ?

Title: Re: Floating point arithmetic question
Post by: Lonewolff on April 17, 2018, 10:05:25 AM

Quote from: jj2007 on April 16, 2018, 10:43:20 PM
What does the help file say?

Help file says I am. The result says otherwise.

@RuiLoureiro - Will do :t

Title: Re: Floating point arithmetic question
Post by: Lonewolff on April 17, 2018, 10:12:49 AM

This works though.

Code Select


	fld number			;// contains 0.523599
	fptan
	fstp result			;// contains 1 why?
	fstp result			;// should be 0.57735

Why is it that two fstp calls are required?

Title: Re: Floating point arithmetic question
Post by: raymond on April 17, 2018, 10:27:52 AM

QuoteWhy is it that two fstp calls are required?

If you follow up on reading the recommended FPU tutorial (more specifically the part relating to the fptan instruction at http://www.ray.masmcode.com/tutorial/fpuchap10.htm#fptan) you would get your answer to your questions, including the last.

It may also give you a hint to explain one of your previous comment:

QuoteAccording to ConverterDD (http://masm32.com/board/index.php?topic=1819.0) the value in RESULT is NaN.

That may possibly be due to valid data already being in the ST(7) and/or the ST(0) register when attempting to compute the tangent. Otherwise
a value of "1" should have been returned by ConverterDD.

Title: Re: Floating point arithmetic question
Post by: Lonewolff on April 17, 2018, 02:03:21 PM

Thanks for the link to the tutorial. Clears it up well :t

Title: Re: Floating point arithmetic question
Post by: RuiLoureiro on April 17, 2018, 10:41:22 PM

Quote from: Lonewolff on April 17, 2018, 10:12:49 AM
This works though.

Quote
fld number ;// contains 0.523599
fptan
fstp resultX ;// contains 1 why? <<<- because fptan gives 2 results, not 1
fstp result ;// should be 0.57735

Why is it that two fstp calls are required?

What do you get for the new resultX variable ? Try to see. Print it. It should be 1.0 as Raymond said.
fstp st should be used in this case, it is faster. It removes the current st(0) when we dont need it.
:t

Title: Re: Floating point arithmetic question
Post by: jj2007 on April 17, 2018, 10:55:42 PM

Quote from: RuiLoureiro on April 17, 2018, 10:41:22 PMfstp st should be used, it is faster. It removes the current st(0) when we dont need it.

Btw there is also fincstp, which at first sight has the same effect. But try a simple fldpi afterwards, and you'll see the difference. Olly has a section with the FPU regs, you must scroll down a little bit the upper right pane to see it.

Title: Re: Floating point arithmetic question
Post by: RuiLoureiro on April 18, 2018, 04:04:04 AM

Quote from: jj2007 on April 17, 2018, 10:55:42 PM
Quote from: RuiLoureiro on April 17, 2018, 10:41:22 PMfstp st should be used, it is faster. It removes the current st(0) when we dont need it.

Btw there is also fincstp, which at first sight has the same effect. But try a simple fldpi afterwards, and you'll see the difference. Olly has a section with the FPU regs, you must scroll down a little bit the upper right pane to see it.

Ok Jochen. I dont know where i have Olly in this computer. I rarely use it. LoneWolff may do this to learn.
What i want to say to LoneWolff is this: generally, we use fstp st to remove st(0) when we dont need it.

Title: Re: Floating point arithmetic question
Post by: Lonewolff on April 18, 2018, 07:47:06 AM

Yeah, the fstp resultX call was just to get rid of the value. I'll swap out the command now that that I know about it.

Thanks guys! Awesome information as always 8)

Title: Re: Floating point arithmetic question
Post by: RuiLoureiro on April 18, 2018, 08:03:43 AM

Quote from: Lonewolff on April 18, 2018, 07:47:06 AM
Yeah, the fstp resultX call was just to get rid of the value. I'll swap out the command now that that I know about it.

Thanks guys! Awesome information as always 8)

Good luck :t

Title: Re: Floating point arithmetic question
Post by: Lonewolff on April 18, 2018, 08:17:31 AM

Quote from: RuiLoureiro on April 18, 2018, 08:03:43 AM
Good luck :t

Thanks Rui :biggrin:

There was a measurable performance increase with the fstp st method.

Code Select


fld number
fptan
fstp st
fstp result

But fincstp gave me a NaN result.

Code Select


fld number
fptan
fincstp
fstp result

Title: Re: Floating point arithmetic question
Post by: Siekmanski on April 18, 2018, 08:22:52 AM

Try this one,
ffree st(0)

Title: Re: Floating point arithmetic question
Post by: Lonewolff on April 18, 2018, 08:26:21 AM

Code Select


fld number
fptan
ffree st(0)
fstp result

That one gives me a Nan result also :(

Title: Re: Floating point arithmetic question
Post by: Lonewolff on April 18, 2018, 08:34:38 AM

Was just reading about the fincstp method.

http://qcd.phys.cmu.edu/QCDcluster/intel/vtune/reference/vc98.htm

Code Select

This operation is not equivalent to popping the stack, because the tag for the previous top-of-stack register is not marked empty.

Title: Re: Floating point arithmetic question
Post by: raymond on April 18, 2018, 10:18:38 AM

Quote from: Lonewolff on April 18, 2018, 08:26:21 AM
Code Select Expand
fld number fptan ffree st(0) fstp result

That one gives me a Nan result also :(

Let's see if you can use your skills to find out in the tutorial why you are getting that result. ;)

Title: Re: Floating point arithmetic question
Post by: Lonewolff on April 18, 2018, 10:31:29 AM

Quote from: raymond on April 18, 2018, 10:18:38 AM
Quote from: Lonewolff on April 18, 2018, 08:26:21 AM
Code Select Expand
fld number fptan ffree st(0) fstp result

That one gives me a Nan result also :(

Let's see if you can use your skills to find out in the tutorial why you are getting that result. ;)

Ha! Got me again.

It's that double pop thing again. Had only just woken up when I tried it out. :t

Title: Re: Floating point arithmetic question
Post by: jj2007 on April 18, 2018, 11:21:32 AM

Quote from: Siekmanski on April 18, 2018, 08:22:52 AM
Try this one,
ffree st(0)

That instruction is about as rare as fincstp, and for the same reason :P

Quote from: Lonewolff on April 18, 2018, 08:34:38 AMThis operation is not equivalent to popping the stack, because the tag for the previous top-of-stack register is not marked empty.

Exactly :t

The point here is: fincstp rotates ST(0) into ST(7). If ST(7) is not empty, i.e. it carries a value, then any attempt to load ST(0) will fail.

In practice, you will never see ffree ST(0) or fincstp, but you will often see

Code Select

fstp st	; pop ST, clear ST(7)
ffree ST(7)	; clear ST(7)
fxch	; to get ST(1), exchange ST(0), ST(1)

Title: Re: Floating point arithmetic question
Post by: Lonewolff on April 18, 2018, 11:24:55 AM

Quote from: jj2007 on April 18, 2018, 11:21:32 AM
Quote from: Siekmanski on April 18, 2018, 08:22:52 AM
Try this one,
ffree st(0)

That instruction is about as rare as fincstp, and for the same reason :P

I found that using that instruction, I had to fincstp anyway to get the right result. Which negated any performance gains.

Seems that fstp st has given the best performance so far.

Title: Re: Floating point arithmetic question
Post by: RuiLoureiro on April 19, 2018, 04:42:17 AM

Hi LoneWolff,
fstp st is the same as fstp st(0).
But there are cases where we want to remove st(1) ...
In these cases we use fstp st(1) ...

In some cases, after one FPU instruction, we add this code to detect an error:

fstsw ax ; store Status Word register to AX register
fwait
shr ax, 1 ; move bit 0 to carry flag
jc _iserror
; go on, here not error
...
; exit here without error

_iserror: fstp st ; remove st(0)
fclex ; clear all bits in the status word register -> new instruction, new error ?
; exit here with an error message

Title: Re: Floating point arithmetic question
Post by: Lonewolff on April 24, 2018, 09:23:29 AM

Hey guys,

I thought I'd ask this in this topic rather than create a new one, as it is still a FP question.

Is there a more efficient way of writing this? I regards to the final 'fstp' call.

This code works fine, just wondering if it is optimal.

Code Select

	fld fP					; farPlane / (farPlane - nearPlane)
	fld nP
	fsub
	fld fP
	fdiv st(0),st(1)
	fstp _res
	mov ecx, _res
	mov [eax+40], ecx		; Store result in _33
	fstp _res				; Clear the last value from the FP stack

If I don't call the final line, a value gets left on the FP stack. I could call 'fpinit' but that is a very slow call.

Just wondering if I am going about this the right way.

Thanks again 8)

[edit]
Actually, thinking about it, I probably just need to play with the order of operation.

[edit2]
Yep worked well. And one line less.

Code Select


	fld fP					; farPlane / (farPlane - nearPlane)
	fld fP
	fld nP
	fsub
	fdiv
	fstp _res
	mov ecx, _res
	mov [eax+40], ecx		; Store result in _33

Title: Re: Floating point arithmetic question
Post by: Siekmanski on April 24, 2018, 10:03:12 AM

   fld fp            ; farPlane / (farPlane - nearPlane)
   fld st
   fld rp
   fsubp
   fdivp
   fstp real4 ptr [eax+40]

:biggrin:

Title: Re: Floating point arithmetic question
Post by: Lonewolff on April 24, 2018, 10:05:52 AM

Nice!

Even better :eusa_clap:

[edit]
Hang on. It says invalid operands on fsubp and fdivp. :P

Were those two lines a typo?

Title: Re: Floating point arithmetic question
Post by: Siekmanski on April 24, 2018, 10:13:51 AM

No, it works.
fsub with a register pop
fdiv with a register pop

edit: I wrote rp instead of np, the rest is ok.

Code Select

   fld      farPlane            ; farPlane / (farPlane - nearPlane)
   fld      st(0)
   fld      nearPlane
   fsubp
   fdivp
   fstp     real4 ptr [eax+40]

Title: Re: Floating point arithmetic question
Post by: Lonewolff on April 24, 2018, 10:25:34 AM

Weird :icon_confused:

Doesn't compile for me with the 'p' suffix.

Quote
(22) : error A2070: invalid instruction operands
(23) : error A2070: invalid instruction operands

Code Select


	fld fP					; farPlane / (farPlane - nearPlane)
	fld fP
	fld nP
	fsubp   ; line 22
	fdivp   ; line 23
	fstp real4 ptr [eax+40]	; Store result in _33

Title: Re: Floating point arithmetic question
Post by: Siekmanski on April 24, 2018, 10:32:45 AM

Strange, it's a valid instruction.

https://www.coursehero.com/file/p42643a0/The-FSUBP-instructions-perform-the-additional-operation-of-popping-the-FPU/

https://www.coursehero.com/file/p42643a0/The-FDIVP-instructions-perform-the-additional-operation-of-popping-the-FPU/

Title: Re: Floating point arithmetic question
Post by: raymond on April 24, 2018, 11:30:52 AM

I would have coded it as follows:

Code Select

fld  fp   ;fp
fld  st   ;fp   fp   ;make copy in st(0)
fsub np   ;(fp-fn)  fp   ;subtract fn from fp in st(0)
fdiv      ;fp/(fp-fn)   ;divide fp in st(1) by the content of st(0) and pop st(0)
          ;st(0) now contains the result of the division
fstp real4 ptr [eax+40]   ;store result and clean fpu

Check the 'Chapter 8 - Arithmetic instructions - with REAL numbers' at http://www.ray.masmcode.com/tutorial/fpuchap8.htm

Title: Re: Floating point arithmetic question
Post by: Lonewolff on April 24, 2018, 01:17:57 PM

Thanks for the 'fsub' tip. :t

I am finding that this...

Code Select

fld fP
fld fP

...is bench marking faster than this though.

Code Select


fld fP
fld st

Title: Re: Floating point arithmetic question
Post by: raymond on April 24, 2018, 01:24:30 PM

Quote from: Lonewolff on April 24, 2018, 01:17:57 PM
Thanks for the 'fsub' tip. :t

I am finding that this...

Code Select Expand
fld fP fld fP

...is bench marking faster than this though.
Code Select Expand
fld fP fld st

Truly surprising. Fetching data from memory, even from cache, is usually considered slower than register-to-register operations.

Title: Re: Floating point arithmetic question
Post by: Lonewolff on April 24, 2018, 01:28:16 PM

Quote from: raymond on April 24, 2018, 01:24:30 PM
Truly surprising. Fetching data from memory, even from cache, is usually considered slower than register-to-register operations.

Yeah I am looping the routine 1000000000 times and timing how long to complete (the whole code snippet that is).

fld ST version 6578 ms
fld fP version 6422 ms

No matter how many times I run it the fP version is quicker on every occasion.

Title: Re: Floating point arithmetic question
Post by: Lonewolff on April 24, 2018, 01:42:12 PM

Wow, the CPU really is a quirking beast.

I have another routine that uses the 'fchs' opcode and I can speed up (or slow down) the tests by 200 ms, just by changing when 'fchs' is called. :dazzled:

I have run it over and over and the results consistently differ by 200 ms, just depending on the placement of 'fchs'.

Title: Re: Floating point arithmetic question
Post by: jj2007 on April 24, 2018, 04:58:19 PM

Quote from: Lonewolff on April 24, 2018, 10:25:34 AM
Weird :icon_confused:

Doesn't compile for me with the 'p' suffix.

fsubp ; line 22
fdivp ; line 23

You may use:
- fsub (no p, but does the same)
- fsubp st(1), st
- an assembler higher than ML 6.15 (8.0+, UAsm, AsmC, JWasm)

Title: Re: Floating point arithmetic question
Post by: RuiLoureiro on April 25, 2018, 12:17:35 AM

Quote from: Lonewolff on April 24, 2018, 01:28:16 PM
Quote from: raymond on April 24, 2018, 01:24:30 PM
Truly surprising. Fetching data from memory, even from cache, is usually considered slower than register-to-register operations.

Yeah I am looping the routine 1000000000 times and timing how long to complete (the whole code snippet that is).

fld ST version 6578 ms
fld fP version 6422 ms

No matter how many times I run it the fP version is quicker on every occasion.

It seems that you are testing the code on your current CPU.
Is it faster on another CPU ?
Any current load is to st(0). So it seems more logical to copy the current st(0)
into a new st(0) - fld st(0) - than to load the same memory value into a new st(0).
So i prefer fld ST version.

Title: Re: Floating point arithmetic question
Post by: Lonewolff on April 25, 2018, 08:23:54 AM

It's tested on a relatively recent AMD. I have an I7 here as well so I will test on that also to compare results.

Will let you know :t

Title: Re: Floating point arithmetic question
Post by: jj2007 on April 25, 2018, 09:42:08 AM

Weird ::)

Code Select

Intel(R) Core(TM) i5-2450M CPU @ 2.50GHz (SSE4)

170     cycles for 100 * fld Real4 mem, mem
269     cycles for 100 * fld Real4 mem, st
169     cycles for 100 * fld Real8 mem, mem
270     cycles for 100 * fld Real8 mem, st
372     cycles for 100 * fld Real10 mem, mem
372     cycles for 100 * fld Real10 mem, st

169     cycles for 100 * fld Real4 mem, mem
267     cycles for 100 * fld Real4 mem, st
169     cycles for 100 * fld Real8 mem, mem
267     cycles for 100 * fld Real8 mem, st
372     cycles for 100 * fld Real10 mem, mem
374     cycles for 100 * fld Real10 mem, st

169     cycles for 100 * fld Real4 mem, mem
269     cycles for 100 * fld Real4 mem, st
169     cycles for 100 * fld Real8 mem, mem
267     cycles for 100 * fld Real8 mem, st
373     cycles for 100 * fld Real10 mem, mem
373     cycles for 100 * fld Real10 mem, st

16      bytes for fld Real4 mem, mem
12      bytes for fld Real4 mem, st
16      bytes for fld Real8 mem, mem
12      bytes for fld Real8 mem, st
16      bytes for fld Real10 mem, mem
12      bytes for fld Real10 mem, st

Code Select

Intel(R) Celeron(R) CPU  N2840  @ 2.16GHz (SSE4)

166     cycles for 100 * fld Real4 mem, mem
165     cycles for 100 * fld Real4 mem, st
175     cycles for 100 * fld Real8 mem, mem
168     cycles for 100 * fld Real8 mem, st
1029    cycles for 100 * fld Real10 mem, mem
596     cycles for 100 * fld Real10 mem, st

163     cycles for 100 * fld Real4 mem, mem
163     cycles for 100 * fld Real4 mem, st
163     cycles for 100 * fld Real8 mem, mem
168     cycles for 100 * fld Real8 mem, st
1041    cycles for 100 * fld Real10 mem, mem
602     cycles for 100 * fld Real10 mem, st

170     cycles for 100 * fld Real4 mem, mem
164     cycles for 100 * fld Real4 mem, st
174     cycles for 100 * fld Real8 mem, mem
169     cycles for 100 * fld Real8 mem, st
1056    cycles for 100 * fld Real10 mem, mem
611     cycles for 100 * fld Real10 mem, st

Title: Re: Floating point arithmetic question
Post by: Siekmanski on April 25, 2018, 09:56:12 AM

Code Select

Intel(R) Core(TM) i7-4930K CPU @ 3.40GHz (SSE4)

168     cycles for 100 * fld Real4 mem, mem
266     cycles for 100 * fld Real4 mem, st
168     cycles for 100 * fld Real8 mem, mem
272     cycles for 100 * fld Real8 mem, st
373     cycles for 100 * fld Real10 mem, mem
373     cycles for 100 * fld Real10 mem, st

167     cycles for 100 * fld Real4 mem, mem
268     cycles for 100 * fld Real4 mem, st
168     cycles for 100 * fld Real8 mem, mem
268     cycles for 100 * fld Real8 mem, st
373     cycles for 100 * fld Real10 mem, mem
372     cycles for 100 * fld Real10 mem, st

167     cycles for 100 * fld Real4 mem, mem
266     cycles for 100 * fld Real4 mem, st
168     cycles for 100 * fld Real8 mem, mem
267     cycles for 100 * fld Real8 mem, st
373     cycles for 100 * fld Real10 mem, mem
373     cycles for 100 * fld Real10 mem, st

16      bytes for fld Real4 mem, mem
12      bytes for fld Real4 mem, st
16      bytes for fld Real8 mem, mem
12      bytes for fld Real8 mem, st
16      bytes for fld Real10 mem, mem
12      bytes for fld Real10 mem, st


--- ok ---

Title: Re: Floating point arithmetic question
Post by: HSE on April 25, 2018, 10:17:25 AM

Code Select

AMD A6-3500 APU with Radeon(tm) HD Graphics (SSE3)

12      cycles for 100 * fld Real4 mem, mem
11      cycles for 100 * fld Real4 mem, st
12      cycles for 100 * fld Real8 mem, mem
11      cycles for 100 * fld Real8 mem, st
612     cycles for 100 * fld Real10 mem, mem
318     cycles for 100 * fld Real10 mem, st

12      cycles for 100 * fld Real4 mem, mem
10      cycles for 100 * fld Real4 mem, st
11      cycles for 100 * fld Real8 mem, mem
11      cycles for 100 * fld Real8 mem, st
612     cycles for 100 * fld Real10 mem, mem
317     cycles for 100 * fld Real10 mem, st

13      cycles for 100 * fld Real4 mem, mem
9       cycles for 100 * fld Real4 mem, st
12      cycles for 100 * fld Real8 mem, mem
10      cycles for 100 * fld Real8 mem, st
612     cycles for 100 * fld Real10 mem, mem
319     cycles for 100 * fld Real10 mem, st

16      bytes for fld Real4 mem, mem
12      bytes for fld Real4 mem, st
16      bytes for fld Real8 mem, mem
12      bytes for fld Real8 mem, st
16      bytes for fld Real10 mem, mem
12      bytes for fld Real10 mem, st


--- ok ---

What happen here? Cycles are so slower than the other machines?

Title: Re: Floating point arithmetic question
Post by: jj2007 on April 25, 2018, 04:45:39 PM

Quote from: HSE on April 25, 2018, 10:17:25 AMWhat happen here? Cycles are so slower than the other machines?

No idea how this can run in 11 cycles (not including the loop overhead, though) ::)

Code Select

  mov ebx, 99	; loop 100x
  align 4
  .Repeat
	fld MyR4
	fld MyR4
	fstp st
	fstp st
	dec ebx
  .Until Sign?

Title: Re: Floating point arithmetic question
Post by: Lonewolff on April 25, 2018, 04:52:40 PM

So I stumbled across something hey? :lol:

Heaps faster all round for real4's to do two fld's. :t

Maybe a caching thing in the CPU itself? Knows it already has the value there so just re-uses it perhaps?

Title: Re: Floating point arithmetic question
Post by: RuiLoureiro on April 25, 2018, 11:21:16 PM

Quote from: jj2007 on April 25, 2018, 04:45:39 PM
Quote from: HSE on April 25, 2018, 10:17:25 AMWhat happen here? Cycles are so slower than the other machines?

No idea how this can run in 11 cycles (not including the loop overhead, though) ::)
Quote
mov ebx, 99 ; loop 100x <<<< THIS IS FOR real4, mem, mem ?
align 4
.Repeat
fld MyR4
fld MyR4 <<<<<<<<<<
fstp st
fstp st
dec ebx
.Until Sign?
-----------------------------
mov ebx, 99 <<<<< this is for real4, mem, st ?
align 4
.Repeat
fld MyR4
fld st <<<<<<<<<<<<<<<<
fstp st
fstp st
dec ebx
.Until Sign?

Good work Jochen :t
Is this what you are doing, correct ?
Now i want to say that i never used real4 or real 8. Only real 10.

Title: Re: Floating point arithmetic question
Post by: RuiLoureiro on April 25, 2018, 11:32:05 PM

Quote
Intel(R) Atom(TM) CPU N455 @ 1.66GHz (SSE4)

624 cycles for 100 * fld Real4 mem, mem
596 cycles for 100 * fld Real4 mem, st
603 cycles for 100 * fld Real8 mem, mem
595 cycles for 100 * fld Real8 mem, st
1426 cycles for 100 * fld Real10 mem, mem
1015 cycles for 100 * fld Real10 mem, st

604 cycles for 100 * fld Real4 mem, mem
595 cycles for 100 * fld Real4 mem, st
600 cycles for 100 * fld Real8 mem, mem
597 cycles for 100 * fld Real8 mem, st
1427 cycles for 100 * fld Real10 mem, mem
1002 cycles for 100 * fld Real10 mem, st

613 cycles for 100 * fld Real4 mem, mem
595 cycles for 100 * fld Real4 mem, st
615 cycles for 100 * fld Real8 mem, mem
599 cycles for 100 * fld Real8 mem, st
1416 cycles for 100 * fld Real10 mem, mem
1355 cycles for 100 * fld Real10 mem, st

16 bytes for fld Real4 mem, mem
12 bytes for fld Real4 mem, st
16 bytes for fld Real8 mem, mem
12 bytes for fld Real8 mem, st
16 bytes for fld Real10 mem, mem
12 bytes for fld Real10 mem, st

-

Title: Re: Floating point arithmetic question
Post by: FORTRANS on April 25, 2018, 11:34:04 PM

Code Select

F:\TEMP\TEST>fld_mem_
Intel(R) Pentium(R) M processor 1.70GHz (SSE2)

293     cycles for 100 * fld Real4 mem, mem
192     cycles for 100 * fld Real4 mem, st
298     cycles for 100 * fld Real8 mem, mem
186     cycles for 100 * fld Real8 mem, st
508     cycles for 100 * fld Real10 mem, mem
423     cycles for 100 * fld Real10 mem, st

290     cycles for 100 * fld Real4 mem, mem
193     cycles for 100 * fld Real4 mem, st
292     cycles for 100 * fld Real8 mem, mem
192     cycles for 100 * fld Real8 mem, st
508     cycles for 100 * fld Real10 mem, mem
423     cycles for 100 * fld Real10 mem, st

289     cycles for 100 * fld Real4 mem, mem
198     cycles for 100 * fld Real4 mem, st
296     cycles for 100 * fld Real8 mem, mem
193     cycles for 100 * fld Real8 mem, st
512     cycles for 100 * fld Real10 mem, mem
425     cycles for 100 * fld Real10 mem, st

16      bytes for fld Real4 mem, mem
12      bytes for fld Real4 mem, st
16      bytes for fld Real8 mem, mem
12      bytes for fld Real8 mem, st
16      bytes for fld Real10 mem, mem
12      bytes for fld Real10 mem, st


--- ok ---

Title: Re: Floating point arithmetic question
Post by: jj2007 on April 25, 2018, 11:49:08 PM

Interesting results ::)
- an AMD that does impossible things
- recent Intel i5/i7 where mem access is faster
- mobile cpus like Celeron and Pentium M where fld st is faster.

What does that teach us? Nothing 8)

Title: Re: Floating point arithmetic question
Post by: Siekmanski on April 26, 2018, 12:34:23 AM

Making fast routines isn't easy in modern times. :(
We need a database for optimizations per system. :bgrin:

Title: Re: Floating point arithmetic question
Post by: hutch-- on April 26, 2018, 02:54:34 AM

> What does that teach us?

Balanced algorithms with testing spread across mixed hardware.

Title: Re: Floating point arithmetic question
Post by: daydreamer on April 26, 2018, 04:00:15 AM

Quote from: Siekmanski on April 26, 2018, 12:34:23 AM
Making fast routines isn't easy in modern times. :(
We need a database for optimizations per system. :bgrin:

make a similar code in masm, like JAVA's Just In Time (JIT) compiler
that first runs and times first code snippet,after that runs second version of code snippet etc and finally compare what runs fastest and keep code to only jump to that version of the code and put up a flag that signals that testing which is fastest is over

Title: Re: Floating point arithmetic question
Post by: Siekmanski on April 26, 2018, 07:06:26 AM

Nah, I prefer Hutches approach.

Title: Re: Floating point arithmetic question
Post by: RuiLoureiro on April 26, 2018, 07:10:25 AM

Quote from: Siekmanski on April 26, 2018, 07:06:26 AM
Nah, I prefer Hutches approach.

Me too :t
On your i7-4930K the results are, more or less, 373 cycles for real10, in both cases...

Title: Re: Floating point arithmetic question
Post by: RuiLoureiro on April 26, 2018, 07:26:32 AM

Here it is another

Quote
Intel(R) Pentium(R) 4 CPU 3.40GHz (SSE3)

193 cycles for 100 * fld Real4 mem, mem
174 cycles for 100 * fld Real4 mem, st

178 cycles for 100 * fld Real8 mem, mem
162 cycles for 100 * fld Real8 mem, st

1644 cycles for 100 * fld Real10 mem, mem
838 cycles for 100 * fld Real10 mem, st
------------------------

177 cycles for 100 * fld Real4 mem, mem
158 cycles for 100 * fld Real4 mem, st

177 cycles for 100 * fld Real8 mem, mem
174 cycles for 100 * fld Real8 mem, st

1644 cycles for 100 * fld Real10 mem, mem
847 cycles for 100 * fld Real10 mem, st
------------------------

193 cycles for 100 * fld Real4 mem, mem
175 cycles for 100 * fld Real4 mem, st

179 cycles for 100 * fld Real8 mem, mem
159 cycles for 100 * fld Real8 mem, st

1649 cycles for 100 * fld Real10 mem, mem
838 cycles for 100 * fld Real10 mem, st
------------------------

16 bytes for fld Real4 mem, mem
12 bytes for fld Real4 mem, st
16 bytes for fld Real8 mem, mem
12 bytes for fld Real8 mem, st
16 bytes for fld Real10 mem, mem
12 bytes for fld Real10 mem, st

--- ok ---

Title: Re: Floating point arithmetic question
Post by: FORTRANS on April 26, 2018, 11:05:12 PM

Hi Jochen,

Quote from: jj2007 on April 25, 2018, 11:49:08 PM
Interesting results ::)
- an AMD that does impossible things
- recent Intel i5/i7 where mem access is faster
- mobile cpus like Celeron and Pentium M where fld st is faster.

What does that teach us? Nothing 8)

Quote from: jj2007 on April 25, 2018, 04:45:39 PM
Quote from: HSE on April 25, 2018, 10:17:25 AMWhat happen here? Cycles are so slower than the other machines?

No idea how this can run in 11 cycles (not including the loop overhead, though) ::)
Code Select Expand
mov ebx, 99 ; loop 100x align 4 .Repeat fld MyR4 fld MyR4 fstp st fstp st dec ebx .Until Sign?

An idea occurred to me.* Loading MyR4 twice may be cached
by the FPU? Would making a 200 element array, and loading a
different value {say MyR4[X],MyR4[X+400]} change the timing
any? Sorry if this is a silly question, the AMD results just looked
too odd.

Regards,

Steve N.

* Yes, it does happen. Of course it doesn't imply a good one.

Title: Re: Floating point arithmetic question
Post by: HSE on April 26, 2018, 11:22:26 PM

Code Select

AMD A6-3500 APU with Radeon(tm) HD Graphics (SSE3)

11      cycles for 100 * fld Real4 mem, mem
7       cycles for 100 * fld Real4 mem, st
50      cycles for 100 * fld Real8 mem, mem
9       cycles for 100 * fld Real8 mem, st
600     cycles for 100 * fld Real10 mem, mem
309     cycles for 100 * fld Real10 mem, st

10      cycles for 100 * fld Real4 mem, mem
7       cycles for 100 * fld Real4 mem, st
50      cycles for 100 * fld Real8 mem, mem
7       cycles for 100 * fld Real8 mem, st
602     cycles for 100 * fld Real10 mem, mem
309     cycles for 100 * fld Real10 mem, st

9       cycles for 100 * fld Real4 mem, mem
7       cycles for 100 * fld Real4 mem, st
52      cycles for 100 * fld Real8 mem, mem
7       cycles for 100 * fld Real8 mem, st
600     cycles for 100 * fld Real10 mem, mem
309     cycles for 100 * fld Real10 mem, st

16      bytes for fld Real4 mem, mem
12      bytes for fld Real4 mem, st
16      bytes for fld Real8 mem, mem
12      bytes for fld Real8 mem, st
16      bytes for fld Real10 mem, mem
12      bytes for fld Real10 mem, st


--- ok ---

Using myR4 and myR4b, myR8 and myR8b, and myR10 and myR10b.

Note that R8 is slower now, but not different with R4.

Title: Re: Floating point arithmetic question
Post by: jj2007 on April 26, 2018, 11:34:14 PM

Quote from: FORTRANS on April 26, 2018, 11:05:12 PMAn idea occurred to me.* Loading MyR4 twice may be cached
by the FPU? Would making a 200 element array, and loading a
different value {say MyR4[X],MyR4[X+400]} change the timing
any? Sorry if this is a silly question, the AMD results just looked
too odd.

Steve,
That is definitely not a silly question, so I tested it with randomly filled arrays:

Code Select

  mov ebx, AlgoLoops-1
  align 4
  .Repeat
	fld MyR4[4*ebx]
	fld MyR4[4*ebx]
	fstp st
	fstp st
	dec ebx
  .Until Sign?

Results with AlgoLoops=10,000:

Code Select

Intel(R) Core(TM) i5-2450M CPU @ 2.50GHz (SSE4)

20248   cycles for 10000 * fld Real4 mem, mem
30267   cycles for 10000 * fld Real4 mem, st
20704   cycles for 10000 * fld Real8 mem, mem
30563   cycles for 10000 * fld Real8 mem, st
40031   cycles for 10000 * fld Real10 mem, mem
40005   cycles for 10000 * fld Real10 mem, st

20247   cycles for 10000 * fld Real4 mem, mem
30314   cycles for 10000 * fld Real4 mem, st
20708   cycles for 10000 * fld Real8 mem, mem
30512   cycles for 10000 * fld Real8 mem, st
40018   cycles for 10000 * fld Real10 mem, mem
40026   cycles for 10000 * fld Real10 mem, st

20248   cycles for 10000 * fld Real4 mem, mem
30272   cycles for 10000 * fld Real4 mem, st
20755   cycles for 10000 * fld Real8 mem, mem
30492   cycles for 10000 * fld Real8 mem, st
40013   cycles for 10000 * fld Real10 mem, mem
40015   cycles for 10000 * fld Real10 mem, st

In short: My i5 couldn't care less :P

Now the Celeron:

Code Select

Intel(R) Celeron(R) CPU  N2840  @ 2.16GHz (SSE4)
16354   cycles for 10000 * fld Real4 mem, mem
15524   cycles for 10000 * fld Real4 mem, st
18059   cycles for 10000 * fld Real8 mem, mem
17376   cycles for 10000 * fld Real8 mem, st
98249   cycles for 10000 * fld Real10 mem, mem
57020   cycles for 10000 * fld Real10 mem, st

16382   cycles for 10000 * fld Real4 mem, mem
17095   cycles for 10000 * fld Real4 mem, st
18286   cycles for 10000 * fld Real8 mem, mem
17329   cycles for 10000 * fld Real8 mem, st
99281   cycles for 10000 * fld Real10 mem, mem
56412   cycles for 10000 * fld Real10 mem, st

15761   cycles for 10000 * fld Real4 mem, mem
16297   cycles for 10000 * fld Real4 mem, st
17914   cycles for 10000 * fld Real8 mem, mem
18403   cycles for 10000 * fld Real8 mem, st
98288   cycles for 10000 * fld Real10 mem, mem
56802   cycles for 10000 * fld Real10 mem, st

Title: Re: Floating point arithmetic question
Post by: HSE on April 26, 2018, 11:59:10 PM

Code Select


 AMD A6-3500 APU with Radeon(tm) HD Graphics (SSE3)

1811    cycles for 10000 * fld Real4 mem, mem
1619    cycles for 10000 * fld Real4 mem, st
3760    cycles for 10000 * fld Real8 mem, mem
??      cycles for 10000 * fld Real8 mem, st
55645   cycles for 10000 * fld Real10 mem, mem
32089   cycles for 10000 * fld Real10 mem, st

1313    cycles for 10000 * fld Real4 mem, mem
??      cycles for 10000 * fld Real4 mem, st
5816    cycles for 10000 * fld Real8 mem, mem
1112    cycles for 10000 * fld Real8 mem, st
61808   cycles for 10000 * fld Real10 mem, mem
31172   cycles for 10000 * fld Real10 mem, st

??      cycles for 10000 * fld Real4 mem, mem
??      cycles for 10000 * fld Real4 mem, st
3241    cycles for 10000 * fld Real8 mem, mem
??      cycles for 10000 * fld Real8 mem, st
61940   cycles for 10000 * fld Real10 mem, mem
27024   cycles for 10000 * fld Real10 mem, st

18      bytes for fld Real4 mem, mem
13      bytes for fld Real4 mem, st
18      bytes for fld Real8 mem, mem
13      bytes for fld Real8 mem, st
16      bytes for fld Real10 mem, mem
12      bytes for fld Real10 mem, st


--- ok ---

8)
now with array in .data?

Code Select

AMD A6-3500 APU with Radeon(tm) HD Graphics (SSE3)

289     cycles for 10000 * fld Real4 mem, mem
1526    cycles for 10000 * fld Real4 mem, st
1665    cycles for 10000 * fld Real8 mem, mem
2670    cycles for 10000 * fld Real8 mem, st
51864   cycles for 10000 * fld Real10 mem, mem
25661   cycles for 10000 * fld Real10 mem, st

270     cycles for 10000 * fld Real4 mem, mem
1552    cycles for 10000 * fld Real4 mem, st
726     cycles for 10000 * fld Real8 mem, mem
2681    cycles for 10000 * fld Real8 mem, st
61508   cycles for 10000 * fld Real10 mem, mem
25214   cycles for 10000 * fld Real10 mem, st

1824    cycles for 10000 * fld Real4 mem, mem
??      cycles for 10000 * fld Real4 mem, st
177     cycles for 10000 * fld Real8 mem, mem
3478    cycles for 10000 * fld Real8 mem, st
51708   cycles for 10000 * fld Real10 mem, mem
31514   cycles for 10000 * fld Real10 mem, st

18      bytes for fld Real4 mem, mem
13      bytes for fld Real4 mem, st
18      bytes for fld Real8 mem, mem
13      bytes for fld Real8 mem, st
16      bytes for fld Real10 mem, mem
12      bytes for fld Real10 mem, st


--- ok ---

fld RealXX mem, mem faster than fld RealXX mem, st ??

Title: Re: Floating point arithmetic question
Post by: RuiLoureiro on April 27, 2018, 12:04:37 AM

?? cycles ... ? what it means ?

Title: Re: Floating point arithmetic question
Post by: jj2007 on April 27, 2018, 12:16:32 AM

Quote from: RuiLoureiro on April 27, 2018, 12:04:37 AM
?? cycles ... ? what it means ?

It means the timing procedure could not establish a valid value. There is something strange with the AMD:

Code Select

289 cycles for 10000 * fld Real4 mem, mem

This is simply impossible. My template uses cpuid + rdtsc from Michael Webster's timer macros (http://the%20http://masm32.com/board/index.php?topic=49.0); no idea what could happen there:

Code Select

		counter_begin TimerLoops, HIGH_PRIORITY_CLASS
			call TestA
		counter_end
		ShowCycles TestA

Could it be that the loop overhead is wrong, or that the spinup loop is too short? Attached a special version for HSE that shows the overhead, and uses once a much longer spinup loop (20x):

Code Select

Intel(R) Core(TM) i5-2450M CPU @ 2.50GHz (SSE4)
loop overhead is approx. 10067/10000 cycles

20191   cycles for 10000 * fld Real4 mem, mem
30011   cycles for 10000 * fld Real4 mem, st
20753   cycles for 10000 * fld Real8 mem, mem
30031   cycles for 10000 * fld Real8 mem, st
39976   cycles for 10000 * fld Real10 mem, mem
40001   cycles for 10000 * fld Real10 mem, st

Title: Re: Floating point arithmetic question
Post by: FORTRANS on April 27, 2018, 12:44:51 AM

Hi,

Thanks for testing.

Quote from: jj2007 on April 26, 2018, 11:34:14 PM

Steve,
That is definitely not a silly question, so I tested it with randomly filled arrays:
Code Select Expand
fld MyR4[4*ebx] fld MyR4[4*ebx]

But again you load the same value twice in a row. That's why I
put in a +400 for the second load. But HSE's results show that
probably would not change things on his AMD.

Thanks,

Steve

Title: Re: Floating point arithmetic question
Post by: jimg on April 27, 2018, 01:37:21 AM

AMD Phenom(tm) II X6 1045T Processor (SSE3)
Spinup done
loop overhead is approx. 20057/10000 cycles

879 cycles for 10000 * fld Real4 mem, mem
6518 cycles for 10000 * fld Real4 mem, st
3628 cycles for 10000 * fld Real8 mem, mem
4023 cycles for 10000 * fld Real8 mem, st
60146 cycles for 10000 * fld Real10 mem, mem
47240 cycles for 10000 * fld Real10 mem, st

841 cycles for 10000 * fld Real4 mem, mem
1091 cycles for 10000 * fld Real4 mem, st
3911 cycles for 10000 * fld Real8 mem, mem
5221 cycles for 10000 * fld Real8 mem, st
60101 cycles for 10000 * fld Real10 mem, mem
43169 cycles for 10000 * fld Real10 mem, st

695 cycles for 10000 * fld Real4 mem, mem
870 cycles for 10000 * fld Real4 mem, st
4738 cycles for 10000 * fld Real8 mem, mem
4917 cycles for 10000 * fld Real8 mem, st
60153 cycles for 10000 * fld Real10 mem, mem
35096 cycles for 10000 * fld Real10 mem, st

18 bytes for fld Real4 mem, mem
13 bytes for fld Real4 mem, st
18 bytes for fld Real8 mem, mem
13 bytes for fld Real8 mem, st
16 bytes for fld Real10 mem, mem
12 bytes for fld Real10 mem, st

-

Title: Re: Floating point arithmetic question
Post by: Siekmanski on April 27, 2018, 01:43:22 AM

Code Select

Intel(R) Core(TM) i7-4930K CPU @ 3.40GHz (SSE4)
Spinup done
loop overhead is approx. 10454/10000 cycles

20329   cycles for 10000 * fld Real4 mem, mem
29593   cycles for 10000 * fld Real4 mem, st
21202   cycles for 10000 * fld Real8 mem, mem
29582   cycles for 10000 * fld Real8 mem, st
39593   cycles for 10000 * fld Real10 mem, mem
39595   cycles for 10000 * fld Real10 mem, st

19664   cycles for 10000 * fld Real4 mem, mem
29596   cycles for 10000 * fld Real4 mem, st
19650   cycles for 10000 * fld Real8 mem, mem
29598   cycles for 10000 * fld Real8 mem, st
39594   cycles for 10000 * fld Real10 mem, mem
39597   cycles for 10000 * fld Real10 mem, st

19614   cycles for 10000 * fld Real4 mem, mem
29630   cycles for 10000 * fld Real4 mem, st
19701   cycles for 10000 * fld Real8 mem, mem
29659   cycles for 10000 * fld Real8 mem, st
39591   cycles for 10000 * fld Real10 mem, mem
39590   cycles for 10000 * fld Real10 mem, st

18      bytes for fld Real4 mem, mem
13      bytes for fld Real4 mem, st
18      bytes for fld Real8 mem, mem
13      bytes for fld Real8 mem, st
16      bytes for fld Real10 mem, mem
12      bytes for fld Real10 mem, st


--- ok ---

Title: Re: Floating point arithmetic question
Post by: jj2007 on April 27, 2018, 03:32:49 AM

Quote from: FORTRANS on April 27, 2018, 12:44:51 AMBut again you load the same value twice in a row.

OK, since you insist :P

Code Select

	fld MyR4[4*ebx]
	fld MyR4[4*ebx+4*2000]

Note that Jim's AMD needs twice as much time for the naked loop; it could be that the AMD does loop and fpu loads simultaneously. Agner (http://www.agner.org/optimize/blog/read.php?i=838) says the latest Ryzen is faster:

QuoteThis makes it possible to execute a tiny loop with up to six instructions in one clock cycle per iteration

Code Select

Intel(R) Core(TM) i5-2450M CPU @ 2.50GHz (SSE4)
Spinup done, 4,200,000,000*dec eax
loop overhead is approx. 10112/10000 cycles

20271   cycles for 10000 * fld Real4 mem, mem
30139   cycles for 10000 * fld Real4 mem, st
20586   cycles for 10000 * fld Real8 mem, mem
30194   cycles for 10000 * fld Real8 mem, st
42013   cycles for 10000 * fld Real10 mem, mem
40749   cycles for 10000 * fld Real10 mem, st

20330   cycles for 10000 * fld Real4 mem, mem
30232   cycles for 10000 * fld Real4 mem, st
20648   cycles for 10000 * fld Real8 mem, mem
30324   cycles for 10000 * fld Real8 mem, st
41906   cycles for 10000 * fld Real10 mem, mem
40546   cycles for 10000 * fld Real10 mem, st

20365   cycles for 10000 * fld Real4 mem, mem
30313   cycles for 10000 * fld Real4 mem, st
20606   cycles for 10000 * fld Real8 mem, mem
30312   cycles for 10000 * fld Real8 mem, st
42380   cycles for 10000 * fld Real10 mem, mem
41474   cycles for 10000 * fld Real10 mem, st

Title: Re: Floating point arithmetic question
Post by: FORTRANS on April 27, 2018, 05:12:40 AM

Hi Jochen,

Thank you again. Now my results have changed. So I am
confused once more.

Code Select

F:\TEMP\TEST>FLD_MEM_.exe
Intel(R) Pentium(R) M processor 1.70GHz (SSE2)
Spinup done, 4,200,000,000*dec eax in 2695 ms
loop overhead is approx. 10082/10000 cycles

22465   cycles for 10000 * fld Real4 mem, mem
25357   cycles for 10000 * fld Real4 mem, st
27534   cycles for 10000 * fld Real8 mem, mem
28920   cycles for 10000 * fld Real8 mem, st
75225   cycles for 10000 * fld Real10 mem, mem
52567   cycles for 10000 * fld Real10 mem, st

22216   cycles for 10000 * fld Real4 mem, mem
26432   cycles for 10000 * fld Real4 mem, st
25397   cycles for 10000 * fld Real8 mem, mem
27759   cycles for 10000 * fld Real8 mem, st
70948   cycles for 10000 * fld Real10 mem, mem
51487   cycles for 10000 * fld Real10 mem, st

20858   cycles for 10000 * fld Real4 mem, mem
25416   cycles for 10000 * fld Real4 mem, st
24065   cycles for 10000 * fld Real8 mem, mem
28251   cycles for 10000 * fld Real8 mem, st
70658   cycles for 10000 * fld Real10 mem, mem
51233   cycles for 10000 * fld Real10 mem, st


--- ok ---

At least the Real10 results did not swap places.

Regards,

Steve N.

Title: Re: Floating point arithmetic question
Post by: RuiLoureiro on April 27, 2018, 07:34:00 AM

Results for the last version

Quote
Intel(R) Pentium(R) 4 CPU 3.40GHz (SSE3)
Spinup done, 4,200,000,000*dec eax in 2041 ms
loop overhead is approx. 18160/10000 cycles

33459 cycles for 10000 * fld Real4 mem, mem
37247 cycles for 10000 * fld Real4 mem, st
49295 cycles for 10000 * fld Real8 mem, mem
45967 cycles for 10000 * fld Real8 mem, st
186538 cycles for 10000 * fld Real10 mem, mem
108538 cycles for 10000 * fld Real10 mem, st

31169 cycles for 10000 * fld Real4 mem, mem
37200 cycles for 10000 * fld Real4 mem, st
41419 cycles for 10000 * fld Real8 mem, mem
46164 cycles for 10000 * fld Real8 mem, st
184284 cycles for 10000 * fld Real10 mem, mem
108466 cycles for 10000 * fld Real10 mem, st

31042 cycles for 10000 * fld Real4 mem, mem
37448 cycles for 10000 * fld Real4 mem, st
42335 cycles for 10000 * fld Real8 mem, mem
46058 cycles for 10000 * fld Real8 mem, st
184647 cycles for 10000 * fld Real10 mem, mem
110142 cycles for 10000 * fld Real10 mem, st

--- ok ---

Title: Re: Floating point arithmetic question
Post by: HSE on April 27, 2018, 10:15:34 AM

Code Select

AMD A6-3500 APU with Radeon(tm) HD Graphics (SSE3)
Spinup done, 4,200,000,000*dec eax in 3757 ms
loop overhead is approx. 19715/10000 cycles

10047   cycles for 10000 * fld Real4 mem, mem
7442    cycles for 10000 * fld Real4 mem, st
13013   cycles for 10000 * fld Real8 mem, mem
9412    cycles for 10000 * fld Real8 mem, st
65984   cycles for 10000 * fld Real10 mem, mem
27897   cycles for 10000 * fld Real10 mem, st

7320    cycles for 10000 * fld Real4 mem, mem
11115   cycles for 10000 * fld Real4 mem, st
12497   cycles for 10000 * fld Real8 mem, mem
12401   cycles for 10000 * fld Real8 mem, st
63567   cycles for 10000 * fld Real10 mem, mem
33683   cycles for 10000 * fld Real10 mem, st

7254    cycles for 10000 * fld Real4 mem, mem
10440   cycles for 10000 * fld Real4 mem, st
12343   cycles for 10000 * fld Real8 mem, mem
7372    cycles for 10000 * fld Real8 mem, st
63265   cycles for 10000 * fld Real10 mem, mem
30161   cycles for 10000 * fld Real10 mem, st


--- ok ---

Title: Re: Floating point arithmetic question
Post by: Lonewolff on April 27, 2018, 10:39:23 AM

What are you guys using to run this?

Title: Re: Floating point arithmetic question
Post by: jj2007 on April 27, 2018, 10:46:38 AM

The exe attached to Reply #65.

Code Select

AMD A6-3500 APU with Radeon(tm) HD Graphics (SSE3)
Spinup done, 4,200,000,000*dec eax in 3757 ms
loop overhead is approx. 19715/10000 cycles

So the AMD takes a long time to get to full speed, and has a high loop overhead. Hm :(

Title: Re: Floating point arithmetic question
Post by: Lonewolff on April 27, 2018, 10:52:29 AM

Cool :t

Code Select


16575   cycles for 10000 * fld Real4 mem, mem
18297   cycles for 10000 * fld Real4 mem, st
26425   cycles for 10000 * fld Real8 mem, mem
19634   cycles for 10000 * fld Real8 mem, st
79024   cycles for 10000 * fld Real10 mem, mem
42922   cycles for 10000 * fld Real10 mem, st

16270   cycles for 10000 * fld Real4 mem, mem
16712   cycles for 10000 * fld Real4 mem, st
24982   cycles for 10000 * fld Real8 mem, mem
17107   cycles for 10000 * fld Real8 mem, st
77920   cycles for 10000 * fld Real10 mem, mem
44588   cycles for 10000 * fld Real10 mem, st

17735   cycles for 10000 * fld Real4 mem, mem
16897   cycles for 10000 * fld Real4 mem, st
23506   cycles for 10000 * fld Real8 mem, mem
23064   cycles for 10000 * fld Real8 mem, st
79142   cycles for 10000 * fld Real10 mem, mem
41752   cycles for 10000 * fld Real10 mem, st

Title: Re: Floating point arithmetic question
Post by: HSE on April 27, 2018, 10:57:37 AM

Pipeline?

Title: Re: Floating point arithmetic question
Post by: Lonewolff on April 27, 2018, 11:02:41 AM

Quote from: HSE on April 27, 2018, 10:57:37 AM
Pipeline?

Pipeline?

Oh, AMD A8-8320

Title: Re: Floating point arithmetic question
Post by: HSE on April 27, 2018, 01:39:53 PM

Sorry Lone I wass asking other thing to JJ.

Pipeline is the nickname of a RISC processors feature.They process an instruction, search memory of the next instruction and read the following instruction in the same cycle (or something like that :biggrin:).

Text only | Text with Images

SMF 2.1.4 © 2023, Simple Machines