Author Topic: Floating point arithmetic question  (Read 2467 times)

Siekmanski

  • Member
  • *****
  • Posts: 1684
Re: Floating point arithmetic question
« Reply #30 on: April 24, 2018, 10:13:51 AM »
No, it works.
fsub with a register pop
fdiv with a register pop

edit: I wrote rp instead of np, the rest is ok.

Code: [Select]
   fld      farPlane            ; farPlane / (farPlane - nearPlane)
   fld      st(0)
   fld      nearPlane
   fsubp
   fdivp
   fstp     real4 ptr [eax+40]
Creative coders use backward thinking techniques as a strategy.

Ascended

  • Member
  • ***
  • Posts: 332
Re: Floating point arithmetic question
« Reply #31 on: April 24, 2018, 10:25:34 AM »
Weird  :icon_confused:

Doesn't compile for me with the 'p' suffix.

Quote
(22) : error A2070: invalid instruction operands
(23) : error A2070: invalid instruction operands

Code: [Select]
fld fP ; farPlane / (farPlane - nearPlane)
fld fP
fld nP
fsubp   ; line 22
fdivp   ; line 23
fstp real4 ptr [eax+40] ; Store result in _33



raymond

  • Member
  • **
  • Posts: 218
    • Raymond's page
Re: Floating point arithmetic question
« Reply #33 on: April 24, 2018, 11:30:52 AM »
I would have coded it as follows:
Code: [Select]
fld  fp   ;fp
fld  st   ;fp   fp   ;make copy in st(0)
fsub np   ;(fp-fn)  fp   ;subtract fn from fp in st(0)
fdiv      ;fp/(fp-fn)   ;divide fp in st(1) by the content of st(0) and pop st(0)
          ;st(0) now contains the result of the division
fstp real4 ptr [eax+40]   ;store result and clean fpu

Check the 'Chapter 8 - Arithmetic instructions - with REAL numbers' at http://www.ray.masmcode.com/tutorial/fpuchap8.htm
Whenever you assume something, you risk being wrong half the time.
http://www.ray.masmcode.com/

Ascended

  • Member
  • ***
  • Posts: 332
Re: Floating point arithmetic question
« Reply #34 on: April 24, 2018, 01:17:57 PM »
Thanks for the 'fsub' tip.  :t

I am finding that this...

Code: [Select]
fld fP
fld fP

...is bench marking faster than this though.

Code: [Select]
fld fP
fld st

raymond

  • Member
  • **
  • Posts: 218
    • Raymond's page
Re: Floating point arithmetic question
« Reply #35 on: April 24, 2018, 01:24:30 PM »
Thanks for the 'fsub' tip.  :t

I am finding that this...

Code: [Select]
fld fP
fld fP

...is bench marking faster than this though.
Code: [Select]
fld fP
fld st

Truly surprising. Fetching data from memory, even from cache, is usually considered slower than register-to-register operations.
Whenever you assume something, you risk being wrong half the time.
http://www.ray.masmcode.com/

Ascended

  • Member
  • ***
  • Posts: 332
Re: Floating point arithmetic question
« Reply #36 on: April 24, 2018, 01:28:16 PM »
Truly surprising. Fetching data from memory, even from cache, is usually considered slower than register-to-register operations.

Yeah I am looping the routine 1000000000 times and timing how long to complete (the whole code snippet that is).

fld ST version      6578 ms
fld fP version      6422 ms

No matter how many times I run it the fP version is quicker on every occasion.

Ascended

  • Member
  • ***
  • Posts: 332
Re: Floating point arithmetic question
« Reply #37 on: April 24, 2018, 01:42:12 PM »
Wow, the CPU really is a quirking beast.

I have another routine that uses the 'fchs' opcode and I can speed up (or slow down) the tests by 200 ms, just by changing when 'fchs' is called.  :dazzled:

I have run it over and over and the results consistently differ by 200 ms, just depending on the placement of 'fchs'.

jj2007

  • Member
  • *****
  • Posts: 8840
  • Assembler is fun ;-)
    • MasmBasic
Re: Floating point arithmetic question
« Reply #38 on: April 24, 2018, 04:58:19 PM »
Weird  :icon_confused:

Doesn't compile for me with the 'p' suffix.

   fsubp   ; line 22
   fdivp   ; line 23

You may use:
- fsub (no p, but does the same)
- fsubp st(1), st
- an assembler higher than ML 6.15 (8.0+, UAsm, AsmC, JWasm)

RuiLoureiro

  • Member
  • ****
  • Posts: 819
Re: Floating point arithmetic question
« Reply #39 on: April 25, 2018, 12:17:35 AM »
Truly surprising. Fetching data from memory, even from cache, is usually considered slower than register-to-register operations.

Yeah I am looping the routine 1000000000 times and timing how long to complete (the whole code snippet that is).

fld ST version      6578 ms
fld fP version       6422 ms

No matter how many times I run it the fP version is quicker on every occasion.
            It seems that you are testing the code on your current CPU.
            Is it faster on another CPU ?
            Any current load is to st(0). So it seems more logical to copy the current st(0)
                 into a new st(0)
- fld  st(0) - than to load the same memory value into a new st(0).
            So i prefer  fld  ST version.
 

Ascended

  • Member
  • ***
  • Posts: 332
Re: Floating point arithmetic question
« Reply #40 on: April 25, 2018, 08:23:54 AM »
It's tested on a relatively recent AMD. I have an I7 here as well so I will test on that also to compare results.

Will let you know  :t

jj2007

  • Member
  • *****
  • Posts: 8840
  • Assembler is fun ;-)
    • MasmBasic
Re: Floating point arithmetic question
« Reply #41 on: April 25, 2018, 09:42:08 AM »
Weird ::)
Code: [Select]
Intel(R) Core(TM) i5-2450M CPU @ 2.50GHz (SSE4)

170     cycles for 100 * fld Real4 mem, mem
269     cycles for 100 * fld Real4 mem, st
169     cycles for 100 * fld Real8 mem, mem
270     cycles for 100 * fld Real8 mem, st
372     cycles for 100 * fld Real10 mem, mem
372     cycles for 100 * fld Real10 mem, st

169     cycles for 100 * fld Real4 mem, mem
267     cycles for 100 * fld Real4 mem, st
169     cycles for 100 * fld Real8 mem, mem
267     cycles for 100 * fld Real8 mem, st
372     cycles for 100 * fld Real10 mem, mem
374     cycles for 100 * fld Real10 mem, st

169     cycles for 100 * fld Real4 mem, mem
269     cycles for 100 * fld Real4 mem, st
169     cycles for 100 * fld Real8 mem, mem
267     cycles for 100 * fld Real8 mem, st
373     cycles for 100 * fld Real10 mem, mem
373     cycles for 100 * fld Real10 mem, st

16      bytes for fld Real4 mem, mem
12      bytes for fld Real4 mem, st
16      bytes for fld Real8 mem, mem
12      bytes for fld Real8 mem, st
16      bytes for fld Real10 mem, mem
12      bytes for fld Real10 mem, st
Code: [Select]
Intel(R) Celeron(R) CPU  N2840  @ 2.16GHz (SSE4)

166     cycles for 100 * fld Real4 mem, mem
165     cycles for 100 * fld Real4 mem, st
175     cycles for 100 * fld Real8 mem, mem
168     cycles for 100 * fld Real8 mem, st
1029    cycles for 100 * fld Real10 mem, mem
596     cycles for 100 * fld Real10 mem, st

163     cycles for 100 * fld Real4 mem, mem
163     cycles for 100 * fld Real4 mem, st
163     cycles for 100 * fld Real8 mem, mem
168     cycles for 100 * fld Real8 mem, st
1041    cycles for 100 * fld Real10 mem, mem
602     cycles for 100 * fld Real10 mem, st

170     cycles for 100 * fld Real4 mem, mem
164     cycles for 100 * fld Real4 mem, st
174     cycles for 100 * fld Real8 mem, mem
169     cycles for 100 * fld Real8 mem, st
1056    cycles for 100 * fld Real10 mem, mem
611     cycles for 100 * fld Real10 mem, st

Siekmanski

  • Member
  • *****
  • Posts: 1684
Re: Floating point arithmetic question
« Reply #42 on: April 25, 2018, 09:56:12 AM »
Code: [Select]
Intel(R) Core(TM) i7-4930K CPU @ 3.40GHz (SSE4)

168     cycles for 100 * fld Real4 mem, mem
266     cycles for 100 * fld Real4 mem, st
168     cycles for 100 * fld Real8 mem, mem
272     cycles for 100 * fld Real8 mem, st
373     cycles for 100 * fld Real10 mem, mem
373     cycles for 100 * fld Real10 mem, st

167     cycles for 100 * fld Real4 mem, mem
268     cycles for 100 * fld Real4 mem, st
168     cycles for 100 * fld Real8 mem, mem
268     cycles for 100 * fld Real8 mem, st
373     cycles for 100 * fld Real10 mem, mem
372     cycles for 100 * fld Real10 mem, st

167     cycles for 100 * fld Real4 mem, mem
266     cycles for 100 * fld Real4 mem, st
168     cycles for 100 * fld Real8 mem, mem
267     cycles for 100 * fld Real8 mem, st
373     cycles for 100 * fld Real10 mem, mem
373     cycles for 100 * fld Real10 mem, st

16      bytes for fld Real4 mem, mem
12      bytes for fld Real4 mem, st
16      bytes for fld Real8 mem, mem
12      bytes for fld Real8 mem, st
16      bytes for fld Real10 mem, mem
12      bytes for fld Real10 mem, st


--- ok ---
Creative coders use backward thinking techniques as a strategy.

HSE

  • Member
  • ****
  • Posts: 842
  • <AMD>< 7-32>
Re: Floating point arithmetic question
« Reply #43 on: April 25, 2018, 10:17:25 AM »

Code: [Select]
AMD A6-3500 APU with Radeon(tm) HD Graphics (SSE3)

12      cycles for 100 * fld Real4 mem, mem
11      cycles for 100 * fld Real4 mem, st
12      cycles for 100 * fld Real8 mem, mem
11      cycles for 100 * fld Real8 mem, st
612     cycles for 100 * fld Real10 mem, mem
318     cycles for 100 * fld Real10 mem, st

12      cycles for 100 * fld Real4 mem, mem
10      cycles for 100 * fld Real4 mem, st
11      cycles for 100 * fld Real8 mem, mem
11      cycles for 100 * fld Real8 mem, st
612     cycles for 100 * fld Real10 mem, mem
317     cycles for 100 * fld Real10 mem, st

13      cycles for 100 * fld Real4 mem, mem
9       cycles for 100 * fld Real4 mem, st
12      cycles for 100 * fld Real8 mem, mem
10      cycles for 100 * fld Real8 mem, st
612     cycles for 100 * fld Real10 mem, mem
319     cycles for 100 * fld Real10 mem, st

16      bytes for fld Real4 mem, mem
12      bytes for fld Real4 mem, st
16      bytes for fld Real8 mem, mem
12      bytes for fld Real8 mem, st
16      bytes for fld Real10 mem, mem
12      bytes for fld Real10 mem, st


--- ok ---
What happen here? Cycles are so slower than the other machines?

jj2007

  • Member
  • *****
  • Posts: 8840
  • Assembler is fun ;-)
    • MasmBasic
Re: Floating point arithmetic question
« Reply #44 on: April 25, 2018, 04:45:39 PM »
What happen here? Cycles are so slower than the other machines?

No idea how this can run in 11 cycles (not including the loop overhead, though) ::)
Code: [Select]
  mov ebx, 99 ; loop 100x
  align 4
  .Repeat
fld MyR4
fld MyR4
fstp st
fstp st
dec ebx
  .Until Sign?