Floating point arithmetic question

Siekmanski · April 24, 2018, 10:13:51 AM

No, it works.
fsub with a register pop
fdiv with a register pop

edit: I wrote rp instead of np, the rest is ok.

   fld      farPlane            ; farPlane / (farPlane - nearPlane)
   fld      st(0)
   fld      nearPlane
   fsubp
   fdivp
   fstp     real4 ptr [eax+40]

Lonewolff · April 24, 2018, 10:25:34 AM

Weird :icon_confused:

Doesn't compile for me with the 'p' suffix.

Quote
(22) : error A2070: invalid instruction operands
(23) : error A2070: invalid instruction operands

Code Select


	fld fP					; farPlane / (farPlane - nearPlane)
	fld fP
	fld nP
	fsubp   ; line 22
	fdivp   ; line 23
	fstp real4 ptr [eax+40]	; Store result in _33

Siekmanski · April 24, 2018, 10:32:45 AM

Strange, it's a valid instruction.

https://www.coursehero.com/file/p42643a0/The-FSUBP-instructions-perform-the-additional-operation-of-popping-the-FPU/

https://www.coursehero.com/file/p42643a0/The-FDIVP-instructions-perform-the-additional-operation-of-popping-the-FPU/

raymond · April 24, 2018, 11:30:52 AM

I would have coded it as follows:

Code Select

fld  fp   ;fp
fld  st   ;fp   fp   ;make copy in st(0)
fsub np   ;(fp-fn)  fp   ;subtract fn from fp in st(0)
fdiv      ;fp/(fp-fn)   ;divide fp in st(1) by the content of st(0) and pop st(0)
          ;st(0) now contains the result of the division
fstp real4 ptr [eax+40]   ;store result and clean fpu

Check the 'Chapter 8 - Arithmetic instructions - with REAL numbers' at http://www.ray.masmcode.com/tutorial/fpuchap8.htm

Lonewolff · April 24, 2018, 01:17:57 PM

Thanks for the 'fsub' tip. :t

I am finding that this...

Code Select

fld fP
fld fP

...is bench marking faster than this though.

Code Select


fld fP
fld st

raymond · April 24, 2018, 01:24:30 PM

Quote from: Lonewolff on April 24, 2018, 01:17:57 PM
Thanks for the 'fsub' tip. :t

I am finding that this...

Code Select Expand
fld fP fld fP

...is bench marking faster than this though.
Code Select Expand
fld fP fld st

Truly surprising. Fetching data from memory, even from cache, is usually considered slower than register-to-register operations.

Lonewolff · April 24, 2018, 01:28:16 PM

Quote from: raymond on April 24, 2018, 01:24:30 PM
Truly surprising. Fetching data from memory, even from cache, is usually considered slower than register-to-register operations.

Yeah I am looping the routine 1000000000 times and timing how long to complete (the whole code snippet that is).

fld ST version 6578 ms
fld fP version 6422 ms

No matter how many times I run it the fP version is quicker on every occasion.

Lonewolff · April 24, 2018, 01:42:12 PM

Wow, the CPU really is a quirking beast.

I have another routine that uses the 'fchs' opcode and I can speed up (or slow down) the tests by 200 ms, just by changing when 'fchs' is called.

I have run it over and over and the results consistently differ by 200 ms, just depending on the placement of 'fchs'.

jj2007 · April 24, 2018, 04:58:19 PM

Quote from: Lonewolff on April 24, 2018, 10:25:34 AM
Weird :icon_confused:

Doesn't compile for me with the 'p' suffix.

fsubp ; line 22
fdivp ; line 23

You may use:
- fsub (no p, but does the same)
- fsubp st(1), st
- an assembler higher than ML 6.15 (8.0+, UAsm, AsmC, JWasm)

RuiLoureiro · April 25, 2018, 12:17:35 AM

Quote from: Lonewolff on April 24, 2018, 01:28:16 PM
Quote from: raymond on April 24, 2018, 01:24:30 PM
Truly surprising. Fetching data from memory, even from cache, is usually considered slower than register-to-register operations.

Yeah I am looping the routine 1000000000 times and timing how long to complete (the whole code snippet that is).

fld ST version 6578 ms
fld fP version 6422 ms

No matter how many times I run it the fP version is quicker on every occasion.

It seems that you are testing the code on your current CPU.
Is it faster on another CPU ?
Any current load is to st(0). So it seems more logical to copy the current st(0)
into a new st(0) - fld st(0) - than to load the same memory value into a new st(0).
So i prefer fld ST version.

Lonewolff · April 25, 2018, 08:23:54 AM

It's tested on a relatively recent AMD. I have an I7 here as well so I will test on that also to compare results.

Will let you know :t

jj2007 · April 25, 2018, 09:42:08 AM

Weird ::)

Code Select

Intel(R) Core(TM) i5-2450M CPU @ 2.50GHz (SSE4)

170     cycles for 100 * fld Real4 mem, mem
269     cycles for 100 * fld Real4 mem, st
169     cycles for 100 * fld Real8 mem, mem
270     cycles for 100 * fld Real8 mem, st
372     cycles for 100 * fld Real10 mem, mem
372     cycles for 100 * fld Real10 mem, st

169     cycles for 100 * fld Real4 mem, mem
267     cycles for 100 * fld Real4 mem, st
169     cycles for 100 * fld Real8 mem, mem
267     cycles for 100 * fld Real8 mem, st
372     cycles for 100 * fld Real10 mem, mem
374     cycles for 100 * fld Real10 mem, st

169     cycles for 100 * fld Real4 mem, mem
269     cycles for 100 * fld Real4 mem, st
169     cycles for 100 * fld Real8 mem, mem
267     cycles for 100 * fld Real8 mem, st
373     cycles for 100 * fld Real10 mem, mem
373     cycles for 100 * fld Real10 mem, st

16      bytes for fld Real4 mem, mem
12      bytes for fld Real4 mem, st
16      bytes for fld Real8 mem, mem
12      bytes for fld Real8 mem, st
16      bytes for fld Real10 mem, mem
12      bytes for fld Real10 mem, st

Code Select

Intel(R) Celeron(R) CPU  N2840  @ 2.16GHz (SSE4)

166     cycles for 100 * fld Real4 mem, mem
165     cycles for 100 * fld Real4 mem, st
175     cycles for 100 * fld Real8 mem, mem
168     cycles for 100 * fld Real8 mem, st
1029    cycles for 100 * fld Real10 mem, mem
596     cycles for 100 * fld Real10 mem, st

163     cycles for 100 * fld Real4 mem, mem
163     cycles for 100 * fld Real4 mem, st
163     cycles for 100 * fld Real8 mem, mem
168     cycles for 100 * fld Real8 mem, st
1041    cycles for 100 * fld Real10 mem, mem
602     cycles for 100 * fld Real10 mem, st

170     cycles for 100 * fld Real4 mem, mem
164     cycles for 100 * fld Real4 mem, st
174     cycles for 100 * fld Real8 mem, mem
169     cycles for 100 * fld Real8 mem, st
1056    cycles for 100 * fld Real10 mem, mem
611     cycles for 100 * fld Real10 mem, st

Siekmanski · April 25, 2018, 09:56:12 AM

Code Select

Intel(R) Core(TM) i7-4930K CPU @ 3.40GHz (SSE4)

168     cycles for 100 * fld Real4 mem, mem
266     cycles for 100 * fld Real4 mem, st
168     cycles for 100 * fld Real8 mem, mem
272     cycles for 100 * fld Real8 mem, st
373     cycles for 100 * fld Real10 mem, mem
373     cycles for 100 * fld Real10 mem, st

167     cycles for 100 * fld Real4 mem, mem
268     cycles for 100 * fld Real4 mem, st
168     cycles for 100 * fld Real8 mem, mem
268     cycles for 100 * fld Real8 mem, st
373     cycles for 100 * fld Real10 mem, mem
372     cycles for 100 * fld Real10 mem, st

167     cycles for 100 * fld Real4 mem, mem
266     cycles for 100 * fld Real4 mem, st
168     cycles for 100 * fld Real8 mem, mem
267     cycles for 100 * fld Real8 mem, st
373     cycles for 100 * fld Real10 mem, mem
373     cycles for 100 * fld Real10 mem, st

16      bytes for fld Real4 mem, mem
12      bytes for fld Real4 mem, st
16      bytes for fld Real8 mem, mem
12      bytes for fld Real8 mem, st
16      bytes for fld Real10 mem, mem
12      bytes for fld Real10 mem, st


--- ok ---

HSE · April 25, 2018, 10:17:25 AM

Code Select

AMD A6-3500 APU with Radeon(tm) HD Graphics (SSE3)

12      cycles for 100 * fld Real4 mem, mem
11      cycles for 100 * fld Real4 mem, st
12      cycles for 100 * fld Real8 mem, mem
11      cycles for 100 * fld Real8 mem, st
612     cycles for 100 * fld Real10 mem, mem
318     cycles for 100 * fld Real10 mem, st

12      cycles for 100 * fld Real4 mem, mem
10      cycles for 100 * fld Real4 mem, st
11      cycles for 100 * fld Real8 mem, mem
11      cycles for 100 * fld Real8 mem, st
612     cycles for 100 * fld Real10 mem, mem
317     cycles for 100 * fld Real10 mem, st

13      cycles for 100 * fld Real4 mem, mem
9       cycles for 100 * fld Real4 mem, st
12      cycles for 100 * fld Real8 mem, mem
10      cycles for 100 * fld Real8 mem, st
612     cycles for 100 * fld Real10 mem, mem
319     cycles for 100 * fld Real10 mem, st

16      bytes for fld Real4 mem, mem
12      bytes for fld Real4 mem, st
16      bytes for fld Real8 mem, mem
12      bytes for fld Real8 mem, st
16      bytes for fld Real10 mem, mem
12      bytes for fld Real10 mem, st


--- ok ---

What happen here? Cycles are so slower than the other machines?

jj2007 · April 25, 2018, 04:45:39 PM

Quote from: HSE on April 25, 2018, 10:17:25 AMWhat happen here? Cycles are so slower than the other machines?

No idea how this can run in 11 cycles (not including the loop overhead, though) ::)

Code Select

  mov ebx, 99	; loop 100x
  align 4
  .Repeat
	fld MyR4
	fld MyR4
	fstp st
	fstp st
	dec ebx
  .Until Sign?

The MASM Forum

News:

Floating point arithmetic question

Siekmanski

Lonewolff

Siekmanski

raymond

Lonewolff

raymond

Lonewolff

Lonewolff

jj2007

RuiLoureiro

Lonewolff

jj2007

Siekmanski

HSE

jj2007