News:

Masm32 SDK description, downloads and other helpful links
Message to All Guests
NB: Posting URL's See here: Posted URL Change

Main Menu

Simple floating point macros.

Started by hutch--, August 13, 2018, 04:36:54 PM

Previous topic - Next topic

jj2007

Quote from: hutch-- on August 17, 2018, 09:37:21 AM"fldz" does do something useful, it loads 0.0, I imagine that is why Intel provide the instruction.

Sure, just like all other 40 or so fpu instructions. But with fldz in a fpinit macro, you just block a register, because you don't "initialise" the fpu by setting a register to zero. Any calculation requires that you load ST(0) first, and only in rare cases with zero.

Quote> (fninit btw doesn't exist, ML.exe encodes it exactly as finit).

He he, you will have to upgrade to ML64, it produces the correct opcode DBE3h.

You are right, I had overlooked the wait (same in ML32, UAsm, AsmC):

9B                   wait                           ; finit
DBE3                 finit
DBE3                 finit                          ; fninit

raymond

Quote(fninit btw doesn't exist, ML.exe encodes it exactly as finit)

That statement is partly right and partly wrong. Let's clear up everything for others unfamiliar with the FPU.

FNINIT does exist.
And, the FINIT and FNINIT have exactly the same coding (i.e. DBE3h) except that the FINIT is automatically preceded by the FWAIT code (i.e. 9Bh).

Several other 'similar' FPU instructions are provided to insert a fwait preceding instruction (FNCLEX, FNSAVE, FNSTCW, FNSTENV, FNSTSW) to insure that the FPU has effectively completed its latest instruction before proceeding with the current one.
Whenever you assume something, you risk being wrong half the time.
https://masm32.com/masmcode/rayfil/index.html

hutch--

This is what Intel say about it.

Opcode      Instruction     64-Bit Mode     Compat/Leg Mode     Description
---------------------------------------------------------------------------
9B DB E3    FINIT           Valid           Valid               Initialize FPU after checking for pending
                                                                unmasked floating-point exceptions.
---------------------------------------------------------------------------
DB E3       FNINIT          Valid           Valid               Initialize FPU without checking for pending
                                                                unmasked floating-point exceptions.
---------------------------------------------------------------------------

jj2007

Quote from: raymond on August 17, 2018, 10:15:43 AMFNINIT does exist.

Indeed - I had corrected my error above. The disassembler splits finit as wait + fninit.

Re use of fldz, here is an example how an fsum(array, elements) macro could look like; plain Masm32:
include \masm32\include\masm32rt.inc

fsum MACRO pSrc, elements
Local step, tmp$
  step=type(pSrc)
  lea eax, pSrc
  lea edx, [eax+step*elements-step]
  if Type(pSrc) eq QWORD
fild QWORD ptr [eax]
.Repeat
add eax, QWORD
fild QWORD ptr [eax]
fadd
.Until eax>=edx
  elseif Type(pSrc) eq DWORD
fild DWORD ptr [eax]
.Repeat
add eax, DWORD
fiadd DWORD ptr [eax]
.Until eax>=edx
  elseif Type(pSrc) eq WORD
fild WORD ptr [eax]
.Repeat
add eax, WORD
fiadd WORD ptr [eax]
.Until eax>=edx

  elseif Type(pSrc) eq REAL10
fld REAL10 ptr [eax]
.Repeat
add eax, REAL10
fld REAL10 ptr [eax]
fadd
.Until eax>=edx
  elseif Type(pSrc) eq REAL8
fld REAL8 ptr [eax]
.Repeat
add eax, REAL8
fadd REAL8 ptr [eax]
.Until eax>=edx
  elseif Type(pSrc) eq REAL4
fld REAL4 ptr [eax]
.Repeat
add eax, REAL4
fadd REAL4 ptr [eax]
.Until eax>=edx
  endif
ENDM

.data
MyWords dw 25, 18, 23, 17, 9, 2, 6
MyDwords dd 25, 18, 23, 17, 9, 2, 7
MyQwords dq 25, 18, 23, 17, 9, 2, 8
MyReal4s REAL4 25.0, 18.0, 23.0, 17.0, 9.0, 2.0, 9.0
MyReal8s REAL8 25.0, 18.0, 23.0, 17.0, 9.0, 2.0, 10.0
MyReal10s REAL10 25.0, 18.0, 23.0, 17.0, 9.0, 2.0, 11.0
Result dd ?

.code
start:

  fsum MyWords, lengthof MyWords ; source, #elements
  fistp Result
  print str$(Result), " is the sum of WORD integers", 13, 10

  fsum MyDwords, lengthof MyDwords ; source, #elements
  fistp Result
  print str$(Result), " is the sum of DWORD integers", 13, 10

  fsum MyQwords, lengthof MyQwords ; source, #elements
  fistp Result
  print str$(Result), " is the sum of QWORD integers", 13, 10

  fsum MyReal4s, lengthof MyReal4s ; source, #elements
  fistp Result
  print str$(Result), " is the sum of MyReal4s", 13, 10

  fsum MyReal8s, lengthof MyReal8s ; source, #elements
  fistp Result
  print str$(Result), " is the sum of MyReal8s", 13, 10

  fsum MyReal10s, lengthof MyReal10s ; source, #elements
  fistp Result
  inkey str$(Result), " is the sum of MyReal10s", 13, 10

  exit

end start


Output:
100 is the sum of WORD integers
101 is the sum of DWORD integers
102 is the sum of QWORD integers
103 is the sum of MyReal4s
104 is the sum of MyReal8s
105 is the sum of MyReal10s

raymond

QuoteIndeed - I had corrected my error above.

I know you did Jochen. I saw it before posting. But I wanted to also emphasize for the less literate members that a few other similar instructions existed. However, I forgot to mention that those with the "N" in their name are the ones which get coded without the preceding fwait instruction.
Whenever you assume something, you risk being wrong half the time.
https://masm32.com/masmcode/rayfil/index.html

hutch--

 :biggrin:

    fsum MACRO args:VARARG
      fldz                      ;; start at 0
      FOR arg,<args>
        fld arg
        fadd
      ENDM
    ENDM


Use like this, the repeat of the same number is for testing, you can put whatever you like there as FP values.

    mrm num,  FLT8(1000.0)
    fsum num,num,num,num,num,num,num,num,num,num
    fstp rslt
    invoke fptoa,rslt,pbuf      ; convert fpval to string
    conout pbuf,lf              ; display at console

jj2007

Your fsum does something completely different. To avoid the unnecessary fldz step (which uses one fpu reg more than needed), it can be written as follows:
fsumsimple MACRO arg0, args:VARARG
  fld arg0
  for arg, <args>
fadd arg
  endm
endm


Usage:

  fsumsimple num, num, num, num, num, num, num, num, num
  fstp result
  Inkey Str$("Result: %f", result)   ; convert result to string, display in console, wait for keypress


Full project attached, for 32- or 64-bit assembly with ML/ML64 or UAsm or AsmC.

hutch--

 :biggrin:

> (which uses one fpu reg more than needed)

What does it matter with sequential additions ?

> fld arg0

You are loading a variable into st(0)
fldz loads zero into st(0)

If you replace "fld arg0" with fldz in your macro, what is the difference apart from fldz being 2 bytes ?

Here is the revised version of your macro which is now a better version that the one I posted.

    jjsum MACRO args:VARARG
      fldz
      for arg, <args>
       fadd arg
      endm
    endm

This is the output.

.text:0000000140001026 D9EE                       fldz
.text:0000000140001028 DC8550FFFFFF               fadd qword ptr [rbp-0xb0]
.text:000000014000102e DC8550FFFFFF               fadd qword ptr [rbp-0xb0]
.text:0000000140001034 DC8550FFFFFF               fadd qword ptr [rbp-0xb0]
.text:000000014000103a DC8550FFFFFF               fadd qword ptr [rbp-0xb0]
.text:0000000140001040 DC8550FFFFFF               fadd qword ptr [rbp-0xb0]
.text:0000000140001046 DC8550FFFFFF               fadd qword ptr [rbp-0xb0]
.text:000000014000104c DC8550FFFFFF               fadd qword ptr [rbp-0xb0]
.text:0000000140001052 DC8550FFFFFF               fadd qword ptr [rbp-0xb0]
.text:0000000140001058 DC8550FFFFFF               fadd qword ptr [rbp-0xb0]
.text:000000014000105e DC8550FFFFFF               fadd qword ptr [rbp-0xb0]
.text:0000000140001064 DD9D50FFFFFF               fstp qword ptr [rbp-0xb0]

How is you loading a variable more efficient than using FLDZ ?

jj2007

Your version:
000000014000108A   | D9 EE                            | fldz                                    |
000000014000108C   | DC 05 B2 02 00 00                | fadd qword ptr ds:[140001344]           |
0000000140001092   | DC 05 AC 02 00 00                | fadd qword ptr ds:[140001344]           |
0000000140001098   | DC 05 A6 02 00 00                | fadd qword ptr ds:[140001344]           |
000000014000109E   | DC 05 A0 02 00 00                | fadd qword ptr ds:[140001344]           |
00000001400010A4   | DD 1D A2 02 00 00                | fstp qword ptr ds:[14000134C]           |


My version:
0000000140001126   | DD 05 18 02 00 00                | fld qword ptr ds:[140001344]            |
000000014000112C   | DC 05 12 02 00 00                | fadd qword ptr ds:[140001344]           |
0000000140001132   | DC 05 0C 02 00 00                | fadd qword ptr ds:[140001344]           |
0000000140001138   | DC 05 06 02 00 00                | fadd qword ptr ds:[140001344]           |
000000014000113E   | DD 1D 08 02 00 00                | fstp qword ptr ds:[14000134C]           |

hutch--

#84
 :biggrin:

So your total gain is 2 bytes.

.text:0000000140001026 D9EE                       fldz
.text:0000000140001028 DC8550FFFFFF               fadd qword ptr [rbp-0xb0]
.text:000000014000102e DC8550FFFFFF               fadd qword ptr [rbp-0xb0]
.text:0000000140001034 DC8550FFFFFF               fadd qword ptr [rbp-0xb0]
.text:000000014000103a DC8550FFFFFF               fadd qword ptr [rbp-0xb0]
.text:0000000140001040 DC8550FFFFFF               fadd qword ptr [rbp-0xb0]
.text:0000000140001046 DC8550FFFFFF               fadd qword ptr [rbp-0xb0]
.text:000000014000104c DC8550FFFFFF               fadd qword ptr [rbp-0xb0]
.text:0000000140001052 DC8550FFFFFF               fadd qword ptr [rbp-0xb0]
.text:0000000140001058 DC8550FFFFFF               fadd qword ptr [rbp-0xb0]
.text:000000014000105e DC8550FFFFFF               fadd qword ptr [rbp-0xb0]
.text:0000000140001064 DD9D58FFFFFF               fstp qword ptr [rbp-0xa8]


.text:000000014000108f DD8550FFFFFF               fld qword ptr [rbp-0xb0]
.text:0000000140001095 DC8550FFFFFF               fadd qword ptr [rbp-0xb0]
.text:000000014000109b DC8550FFFFFF               fadd qword ptr [rbp-0xb0]
.text:00000001400010a1 DC8550FFFFFF               fadd qword ptr [rbp-0xb0]
.text:00000001400010a7 DC8550FFFFFF               fadd qword ptr [rbp-0xb0]
.text:00000001400010ad DC8550FFFFFF               fadd qword ptr [rbp-0xb0]
.text:00000001400010b3 DC8550FFFFFF               fadd qword ptr [rbp-0xb0]
.text:00000001400010b9 DC8550FFFFFF               fadd qword ptr [rbp-0xb0]
.text:00000001400010bf DC8550FFFFFF               fadd qword ptr [rbp-0xb0]
.text:00000001400010c5 DC8550FFFFFF               fadd qword ptr [rbp-0xb0]
.text:00000001400010cb DD9D58FFFFFF               fstp qword ptr [rbp-0xa8]

I declare you the winner by 2 bytes.  :P

I managed to get a benchmark around both and yours is faster by about the difference in the number of instructions. I may even steal a variation of it.   :P

daydreamer

#85
fsincos0 MACRO
fldz
fld1
ENDM
fsincos90 MACRO
fld1
fldz
ENDM
fsincos180 MACRO
fldz
fld1
fchs
ENDM
fsincos45 MACRO
fld sqrtreciprocalof2
fld st0 ;must ask Raymond and other fpu experienced if this is the best way of create two copies of st0,to be in both st0 and st1?
ENDM
fsincos30 MACRO
fld half
fld halfsqrt3
ENDM
fsincos60 MACRO
fld halfsqrt3
fld half
ENDM
;The only difference is cosine Changes sign after 90 degrees
fsincos120 MACRO
fsincos30
fchs
ENDM
fsincos150 MACRO
fchs
ENDM
;you could make more macros after 180degrees sine Changes sign
tan0 MACRO
fldz
ENDM
tan45 MACRO
fld1
ENDM
tan60 MACRO
fld sqrt3
ENDM
tan30 MACRO
fld reciprocalsqrt3
ENDM



first you can initalize most used sqrt constants used in trigo with help of calculate sqrt(2),sqrt(3),sqrt(5),sqrt(6),with help of SSE SQRTPD,or fpu in beginning of program

I know more exact trigo functions,if anyone is interested I can make more macros?(about 150 of them)

useful for producing radian angles from pi(180degrees),example 2pi(360degrees),pi/2 (90 degrees),pi/4 (45 degrees),pi/8 22.5 degrees
or to prepare a float constant before loop thru 256,512,1024 trigo calculations (fptan,fsin,fcos,fsincos),from zero to pi degrees, or zero to 2pi degrees

fscalepi MACRO scale
fld scale
fldpi
fscale
or you want radians to 360degrees,or need any constant for x iterations thru a trigo calculation loop zero to pi(180) degrees
ENDM
fdivpi MACRO x
fldpi
fdiv x
ENDM
my none asm creations
https://masm32.com/board/index.php?topic=6937.msg74303#msg74303
I am an Invoker
"An Invoker is a mage who specializes in the manipulation of raw and elemental energies."
Like SIMD coding

jj2007

Quote from: hutch-- on August 17, 2018, 09:53:29 PMI may even steal a variation of it

I'll be happy if you do that, Hutch, as it's simply the right way to do it :icon14:


hutch--

 :biggrin:

    fpsum MACRO arg1, args:VARARG   ;; sum a list of numbers
      fld arg1
      FOR arg, <args>               ;; stolen from JJ :)
        fadd arg
      ENDM
    ENDM

:P

daydreamer

Quote from: hutch-- on August 19, 2018, 12:12:05 AM
:biggrin:

    fpsum MACRO arg1, args:VARARG   ;; sum a list of numbers
      fld arg1
      FOR arg, <args>               ;; stolen from JJ :)
        fadd arg
      ENDM
    ENDM

:P
thats nice macro,wonder if its possible to extend it to a fpmedium macro?dont know if its possible with end it somehow with
fdiv numberofargs ?
my none asm creations
https://masm32.com/board/index.php?topic=6937.msg74303#msg74303
I am an Invoker
"An Invoker is a mage who specializes in the manipulation of raw and elemental energies."
Like SIMD coding

jj2007

Quote from: daydreamer on August 19, 2018, 12:41:26 AMwonder if its possible to extend it to a fpmedium macro?
Sure:fpaverage MACRO arg1, args:VARARG
Local ct
  ct=1
  fld arg1
  FOR arg, <args>
    ct=ct+1
    fadd arg
  ENDM
  push ct
  fidiv dword ptr [esp]
  add esp, DWORD
ENDM