News:

Masm32 SDK description, downloads and other helpful links
Message to All Guests

Main Menu

Floating point arithmetic question

Started by Lonewolff, April 16, 2018, 08:20:24 PM

Previous topic - Next topic

Siekmanski

No, it works.
fsub with a register pop
fdiv with a register pop

edit: I wrote rp instead of np, the rest is ok.

   fld      farPlane            ; farPlane / (farPlane - nearPlane)
   fld      st(0)
   fld      nearPlane
   fsubp
   fdivp
   fstp     real4 ptr [eax+40]
Creative coders use backward thinking techniques as a strategy.

Lonewolff

Weird  :icon_confused:

Doesn't compile for me with the 'p' suffix.

Quote
(22) : error A2070: invalid instruction operands
(23) : error A2070: invalid instruction operands


fld fP ; farPlane / (farPlane - nearPlane)
fld fP
fld nP
fsubp   ; line 22
fdivp   ; line 23
fstp real4 ptr [eax+40] ; Store result in _33



Siekmanski

Strange, it's a valid instruction.

https://www.coursehero.com/file/p42643a0/The-FSUBP-instructions-perform-the-additional-operation-of-popping-the-FPU/

https://www.coursehero.com/file/p42643a0/The-FDIVP-instructions-perform-the-additional-operation-of-popping-the-FPU/
Creative coders use backward thinking techniques as a strategy.

raymond

I would have coded it as follows:
fld  fp   ;fp
fld  st   ;fp   fp   ;make copy in st(0)
fsub np   ;(fp-fn)  fp   ;subtract fn from fp in st(0)
fdiv      ;fp/(fp-fn)   ;divide fp in st(1) by the content of st(0) and pop st(0)
          ;st(0) now contains the result of the division
fstp real4 ptr [eax+40]   ;store result and clean fpu


Check the 'Chapter 8 - Arithmetic instructions - with REAL numbers' at http://www.ray.masmcode.com/tutorial/fpuchap8.htm
Whenever you assume something, you risk being wrong half the time.
http://www.ray.masmcode.com

Lonewolff

Thanks for the 'fsub' tip.  :t

I am finding that this...

fld fP
fld fP


...is bench marking faster than this though.


fld fP
fld st

raymond

Quote from: Lonewolff on April 24, 2018, 01:17:57 PM
Thanks for the 'fsub' tip.  :t

I am finding that this...

fld fP
fld fP


...is bench marking faster than this though.

fld fP
fld st


Truly surprising. Fetching data from memory, even from cache, is usually considered slower than register-to-register operations.
Whenever you assume something, you risk being wrong half the time.
http://www.ray.masmcode.com

Lonewolff

Quote from: raymond on April 24, 2018, 01:24:30 PM
Truly surprising. Fetching data from memory, even from cache, is usually considered slower than register-to-register operations.

Yeah I am looping the routine 1000000000 times and timing how long to complete (the whole code snippet that is).

fld ST version      6578 ms
fld fP version      6422 ms

No matter how many times I run it the fP version is quicker on every occasion.

Lonewolff

Wow, the CPU really is a quirking beast.

I have another routine that uses the 'fchs' opcode and I can speed up (or slow down) the tests by 200 ms, just by changing when 'fchs' is called.  :dazzled:

I have run it over and over and the results consistently differ by 200 ms, just depending on the placement of 'fchs'.

jj2007

Quote from: Lonewolff on April 24, 2018, 10:25:34 AM
Weird  :icon_confused:

Doesn't compile for me with the 'p' suffix.

   fsubp   ; line 22
   fdivp   ; line 23

You may use:
- fsub (no p, but does the same)
- fsubp st(1), st
- an assembler higher than ML 6.15 (8.0+, UAsm, AsmC, JWasm)

RuiLoureiro

Quote from: Lonewolff on April 24, 2018, 01:28:16 PM
Quote from: raymond on April 24, 2018, 01:24:30 PM
Truly surprising. Fetching data from memory, even from cache, is usually considered slower than register-to-register operations.

Yeah I am looping the routine 1000000000 times and timing how long to complete (the whole code snippet that is).

fld ST version      6578 ms
fld fP version       6422 ms

No matter how many times I run it the fP version is quicker on every occasion.
It seems that you are testing the code on your current CPU.
            Is it faster on another CPU ?
            Any current load is to st(0). So it seems more logical to copy the current st(0)
                 into a new st(0)
- fld  st(0) - than to load the same memory value into a new st(0).
            So i prefer  fld  ST version.

Lonewolff

It's tested on a relatively recent AMD. I have an I7 here as well so I will test on that also to compare results.

Will let you know  :t

jj2007

Weird ::)Intel(R) Core(TM) i5-2450M CPU @ 2.50GHz (SSE4)

170     cycles for 100 * fld Real4 mem, mem
269     cycles for 100 * fld Real4 mem, st
169     cycles for 100 * fld Real8 mem, mem
270     cycles for 100 * fld Real8 mem, st
372     cycles for 100 * fld Real10 mem, mem
372     cycles for 100 * fld Real10 mem, st

169     cycles for 100 * fld Real4 mem, mem
267     cycles for 100 * fld Real4 mem, st
169     cycles for 100 * fld Real8 mem, mem
267     cycles for 100 * fld Real8 mem, st
372     cycles for 100 * fld Real10 mem, mem
374     cycles for 100 * fld Real10 mem, st

169     cycles for 100 * fld Real4 mem, mem
269     cycles for 100 * fld Real4 mem, st
169     cycles for 100 * fld Real8 mem, mem
267     cycles for 100 * fld Real8 mem, st
373     cycles for 100 * fld Real10 mem, mem
373     cycles for 100 * fld Real10 mem, st

16      bytes for fld Real4 mem, mem
12      bytes for fld Real4 mem, st
16      bytes for fld Real8 mem, mem
12      bytes for fld Real8 mem, st
16      bytes for fld Real10 mem, mem
12      bytes for fld Real10 mem, st

Intel(R) Celeron(R) CPU  N2840  @ 2.16GHz (SSE4)

166     cycles for 100 * fld Real4 mem, mem
165     cycles for 100 * fld Real4 mem, st
175     cycles for 100 * fld Real8 mem, mem
168     cycles for 100 * fld Real8 mem, st
1029    cycles for 100 * fld Real10 mem, mem
596     cycles for 100 * fld Real10 mem, st

163     cycles for 100 * fld Real4 mem, mem
163     cycles for 100 * fld Real4 mem, st
163     cycles for 100 * fld Real8 mem, mem
168     cycles for 100 * fld Real8 mem, st
1041    cycles for 100 * fld Real10 mem, mem
602     cycles for 100 * fld Real10 mem, st

170     cycles for 100 * fld Real4 mem, mem
164     cycles for 100 * fld Real4 mem, st
174     cycles for 100 * fld Real8 mem, mem
169     cycles for 100 * fld Real8 mem, st
1056    cycles for 100 * fld Real10 mem, mem
611     cycles for 100 * fld Real10 mem, st

Siekmanski

Intel(R) Core(TM) i7-4930K CPU @ 3.40GHz (SSE4)

168     cycles for 100 * fld Real4 mem, mem
266     cycles for 100 * fld Real4 mem, st
168     cycles for 100 * fld Real8 mem, mem
272     cycles for 100 * fld Real8 mem, st
373     cycles for 100 * fld Real10 mem, mem
373     cycles for 100 * fld Real10 mem, st

167     cycles for 100 * fld Real4 mem, mem
268     cycles for 100 * fld Real4 mem, st
168     cycles for 100 * fld Real8 mem, mem
268     cycles for 100 * fld Real8 mem, st
373     cycles for 100 * fld Real10 mem, mem
372     cycles for 100 * fld Real10 mem, st

167     cycles for 100 * fld Real4 mem, mem
266     cycles for 100 * fld Real4 mem, st
168     cycles for 100 * fld Real8 mem, mem
267     cycles for 100 * fld Real8 mem, st
373     cycles for 100 * fld Real10 mem, mem
373     cycles for 100 * fld Real10 mem, st

16      bytes for fld Real4 mem, mem
12      bytes for fld Real4 mem, st
16      bytes for fld Real8 mem, mem
12      bytes for fld Real8 mem, st
16      bytes for fld Real10 mem, mem
12      bytes for fld Real10 mem, st


--- ok ---
Creative coders use backward thinking techniques as a strategy.

HSE


AMD A6-3500 APU with Radeon(tm) HD Graphics (SSE3)

12      cycles for 100 * fld Real4 mem, mem
11      cycles for 100 * fld Real4 mem, st
12      cycles for 100 * fld Real8 mem, mem
11      cycles for 100 * fld Real8 mem, st
612     cycles for 100 * fld Real10 mem, mem
318     cycles for 100 * fld Real10 mem, st

12      cycles for 100 * fld Real4 mem, mem
10      cycles for 100 * fld Real4 mem, st
11      cycles for 100 * fld Real8 mem, mem
11      cycles for 100 * fld Real8 mem, st
612     cycles for 100 * fld Real10 mem, mem
317     cycles for 100 * fld Real10 mem, st

13      cycles for 100 * fld Real4 mem, mem
9       cycles for 100 * fld Real4 mem, st
12      cycles for 100 * fld Real8 mem, mem
10      cycles for 100 * fld Real8 mem, st
612     cycles for 100 * fld Real10 mem, mem
319     cycles for 100 * fld Real10 mem, st

16      bytes for fld Real4 mem, mem
12      bytes for fld Real4 mem, st
16      bytes for fld Real8 mem, mem
12      bytes for fld Real8 mem, st
16      bytes for fld Real10 mem, mem
12      bytes for fld Real10 mem, st


--- ok ---

What happen here? Cycles are so slower than the other machines?
Equations in Assembly: SmplMath

jj2007

Quote from: HSE on April 25, 2018, 10:17:25 AMWhat happen here? Cycles are so slower than the other machines?

No idea how this can run in 11 cycles (not including the loop overhead, though) ::)
  mov ebx, 99 ; loop 100x
  align 4
  .Repeat
fld MyR4
fld MyR4
fstp st
fstp st
dec ebx
  .Until Sign?