The Floating point road

LordAdef · May 09, 2018, 03:23:54 PM

Hi, after hammering over this FPU code fpr hours and not getting the right result, I came to realize printf ("result: %f", real4Value) doesn't work (or am I doing something stupid as usual?)

I decide I need to study floating points (in Asm obviously..) once and for all. So I cracked into "Modern x86 Assembly Language Programming" by D. Kusswurm (I owned a copy, after Hutch's suggestion, GREAT book btw).

Since he was outputting his results through printf in c++, and I was having the same 0.0000 results. I am now sure my printf doesn't work.
The question is, why??

I'm using MasmBasic

Code Select

Print Str$ (val3)

Question #1:
Since I can really learn this stuff if I output results, what's the way to do it. A simple and straightforward way to do it?

Question #2:
I've already studied Raymond's FPU great tutorial and am now studying SSE

When is one suitable over the other? I saw Marinus sometimes opted for FPU and sometimes he prefer SSE.

edit to add a Question #3:
In terms of efficiency, is it better to do any pre integer computing in register and only then take the values to SSE with cvtsi2ss (well, I learnt this one today, and it works :lol: ) or do the whole thing in SSE?

tenkey · May 09, 2018, 03:50:56 PM

The C/C++ compiler converts floats (REAL4) to doubles (REAL8), so printf's %f handles only REAL8.

LordAdef · May 09, 2018, 03:54:17 PM

Quote from: tenkey on May 09, 2018, 03:50:56 PM
The C/C++ compiler converts floats (REAL4) to doubles (REAL8), so %f handles only REAL8.

Thank you for that, sir!

jj2007 · May 09, 2018, 06:00:48 PM

Quote from: LordAdef on May 09, 2018, 03:23:54 PMI'm using MasmBasic
Code Select Expand
Print Str$ (val3)

And it works, I guess :P

QuoteQuestion #2:
I've already studied Raymond's FPU great tutorial and am now studying SSE

When is one suitable over the other?

Under the hood, the cpu probably uses the same circuits. Personally, I prefer the FPU for standard tasks - it is as fast as the SIMD instructions but uses a higher precision. And it has become a "protected area" because the OS hardly uses it any more, so you are free to do whatever you want.

SSE2 is the best choice for anything that can be parallelised. Multiply four DWORDs with one instruction - that you cannot beat with 4 FPU instructions.

SSE3 and higher offer some more exotic instructions; you will rarely need them, and you limit your code to newer CPUs. Not a big problem nowadays, any machine that is younger than 5 or 10 years has them, but it is a point to remember.

To give you an indication: MasmBasic uses roughly the same amount of FPU and SSE2 code.

One important point: If you are calling Windows APIs, check if your FPU or SIMD registers are still intact. MasmBasic saves the xmmregs because I discovered that after WinXP, the OS started to trash xmm0 ... xmm3. See Windows trashes xmm regs but not the FPU

HSE · May 09, 2018, 11:31:46 PM

Quote from: LordAdef on May 09, 2018, 03:23:54 PM
Question #1:
Since I can really learn this stuff if I output results, what's the way to do it. A simple and straightforward way to do it?

You can see macros.asm

Code Select

print real4$(val3)

raymond · May 10, 2018, 02:30:11 AM

QuoteQuestion #1:
Since I can really learn this stuff if I output results, what's the way to do it. A simple and straightforward way to do it?

Another option is to use the FpuFLtoA function in the FPU library generally provided with the MASM32 FDK. You can also download that library from the same site as the FPU tutorial, i.e. http://www.ray.masmcode.com/fpu.html#fpulib. The resulting string which you can pre-format to some extent gets transfered to a memory location of your choice, from where you can display it wherever/whenever/however you may wish.

daydreamer · May 10, 2018, 04:09:30 AM

Quote from: LordAdef on May 09, 2018, 03:23:54 PM
Question #1:
Since I can really learn this stuff if I output results, what's the way to do it. A simple and straightforward way to do it?

Question #2:
I've already studied Raymond's FPU great tutorial and am now studying SSE

When is one suitable over the other? I saw Marinus sometimes opted for FPU and sometimes he prefer SSE.

edit to add a Question #3:
In terms of efficiency, is it better to do any pre integer computing in register and only then take the values to SSE with cvtsi2ss (well, I learnt this one today, and it works :lol: ) or do the whole thing in SSE?

what floats your boat?
#1, why dont you search forum and old forum, to see all students old posts about exercise on display integers,floats with help of print
#2 what floats your boat? and what kind of asm program you want to make as exercise when Learning one of them?
advanced math calculator,circledrawing,hyperbole,etc code?fpu and fpu library or check Raymonds fixed Point library to learn howto things was made Before fpu on crappy few mhz computers?
fractals made with fpu code is in ron Thomas ebook,even if its dos fpu works the same with real4,real8,real10's
image processing its great with SSE2/MMX that handles 4 ARGB channel simultanously with builtin caps for max 0-255,for sound it has sounds processing versions of it handling signed values instead,you can choose between 8bits,16bit,32bits packed instructions
Randy Hyde has a good chapter on MMX(about the same as SSE2 integer instructions that use 128bit .xmm regs instead) in his assembly book
if you want same quality as pixelshader, make use of 4 floats for ARGB processing, use SSE/SSE2 floating Point instructions instead,2d light effect easily made with SQRTPS,RSQRTPS and a final float-integer conversion you have a bitmap
or SSE floats with math library or without if you like to write your own sine/cosine function or other math function,matrices

#3,check Raymonds fixed Point library,if you learn how it works, you can both use usual register or learn to do same with SSE2 integer the same way

do you want to test write code that works this way instead:
it takes in 4 conditional checks and creates 4 masks, that is either 0 or FFFFh,that you can use for mask out a mathematical operations

instead of 4 if's conditional jump code?

@JJ,why dont you make two macros that just pops and pushes only those regs OS trashes, instead of FXSAVE and FXRSTORE?
that would be useful for all of use who want to code SSE code without wasting energy and time on strange bugs, that OS trashing is cause of?

Lonewolff · May 10, 2018, 07:21:03 AM

Quote from: tenkey on May 09, 2018, 03:50:56 PM
The C/C++ compiler converts floats (REAL4) to doubles (REAL8)

Wut?? :icon_confused:

jj2007 · May 10, 2018, 07:38:26 AM

Quote from: daydreamer on May 10, 2018, 04:09:30 AM@JJ,why dont you make two macros that just pops and pushes only those regs OS trashes, instead of FXSAVE and FXRSTORE?

According to Agner Fog, the OS trashes all 8 xmm regs in 32-bit code. Speedwise it is still better to save them by hand:

Code Select

Intel(R) Core(TM) i5-2450M CPU @ 2.50GHz (SSE4)

13275   cycles for 100 * fxsave/fxrstor
676     cycles for 100 * save xmm0 .. 7

13297   cycles for 100 * fxsave/fxrstor
680     cycles for 100 * save xmm0 .. 7

But:

Code Select

14      bytes for fxsave/fxrstor
112     bytes for save xmm0 .. 7

And speed doesn't play any role when calling the Windows API. Guess what happens if you squeeze two little WinAPI calls into this loop:

Code Select

  .Repeat
	ct=0
	REPEAT 8
		movaps MbXs[ct*OWORD], @CatStr(<xmm>, %ct)
		ct=ct+1
	ENDM
	if 0
		sub esp, RECT
		invoke GetWindowRect, rv(GetConsoleWindow), esp
		add esp, RECT
	endif
	ct=0
	REPEAT 8
		movaps @CatStr(<xmm>, %ct), MbXs[ct*OWORD]
		ct=ct+1
	ENDM
	dec ebx
  .Until Sign?

Code Select

13      kCycles for 100 * fxsave/fxrstor
0       kCycles for 100 * save xmm0 .. 7
1700    kCycles for 100 * save xmm0 .. 7, plus a WinAPI call

With if 0, 0 kCycles
With if 1, 1700 kCycles, a factor 130 slower than the "slow" fxsave/fxrstor sequence.

LordAdef · May 10, 2018, 07:59:25 AM

If needed, where the best place to save them? Is it possible to save them to FPU? Or better to variables?

daydreamer · May 10, 2018, 08:27:57 AM

Quote from: LordAdef on May 10, 2018, 07:59:25 AM
If needed, where the best place to save them? Is it possible to save them to FPU? Or better to variables?

fpu regs is too small to save all xmm regs
check this thread and its macros and links to other masm threads
http://masm32.com/board/index.php?topic=7123.0
you maybe prefer dynamic allocated memory over .data? section,works as long as you use align 16, or indirect reg aligned on 16byte boundary
check Hutch 64bit solution
http://masm32.com/board/index.php?topic=7121.msg76859#new

jj2007 · May 10, 2018, 10:16:26 AM

Quoteyou maybe prefer dynamic allocated memory over .data? section,works as long as you use align 16, or indirect reg aligned on 16byte boundary

For such small amounts, the .data? section is ok (but not thread-safe); a local variable will also do the job. If you want to preserve only a few xmm regs, don't worry about align 16, just use movups instead of movaps. The performance penalty is negligible.

LordAdef · May 10, 2018, 10:38:36 AM

Quote from: jj2007 on May 10, 2018, 10:16:26 AM
Quoteyou maybe prefer dynamic allocated memory over .data? section,works as long as you use align 16, or indirect reg aligned on 16byte boundary

For such small amounts, the .data? section is ok (but not thread-safe); a local variable will also do the job. If you want to preserve only a few xmm regs, don't worry about align 16, just use movups instead of movaps. The performance penalty is negligible.

Code Select

FadeInTerrain proc tTrn, tLines
       ; blendStep = 255/lineCount * mapspace
    LOCAL tBlnd:REAL4

    mov      tBlnd, 255.0
    movss     xmm0, tBlnd
    cvtsi2ss xmm1, tLines
    divss     xmm0, xmm1
    movss     fadeIncrement, xmm0                

    ; push ecx
    ; con "fadeStepVal is %s", real4$(fadeIncrement)
    ; pop ecx

    mov edi, tTrn
    andcurrBlendFloat, 0.0                    ; reset blend float value
    mov trn.blend[edi], 0                    ; blend is 0
    mov trn.isFade[edi], 1                    ; 1== fadein
    ret
FadeInTerrain endp

This code is working. You mentioned movaps/mocups, but I'm using movss. Any Thoughts?

Concerning the saving place for the floats, how about a struct of floats as container?

jj2007 · May 10, 2018, 11:02:33 AM

Quote from: LordAdef on May 10, 2018, 10:38:36 AMYou mentioned movaps/mocups, but I'm using movss. Any Thoughts?

movss is perfect for singles, and movd does the same. The movaps and movups instructions deal with the full OWORDs.

QuoteConcerning the saving place for the floats, how about a struct of floats as container?

Good idea :t

Code Select

FadeInTerrain proc tTrn, tLines
  ; blendStep = 255/lineCount * mapspace
if 0
	LOCAL tBlnd:REAL4
	mov      tBlnd, 255.0
	movss     xmm0, tBlnd
else
	.data
	tBlnd REAL4 255.0	; 5 bytes shorter
	.code
	movss     xmm0, tBlnd
endif

tenkey · May 11, 2018, 09:02:34 AM

Quote from: Ascended on May 10, 2018, 07:21:03 AM
Quote from: tenkey on May 09, 2018, 03:50:56 PM
The C/C++ compiler converts floats (REAL4) to doubles (REAL8)

Wut?? :icon_confused:

The printf function is built to work seamlessly with C code. That means printf's floating point arguments must be REAL8 when you are calling directly from ASM.

The MASM Forum

News:

The Floating point road

LordAdef

tenkey

LordAdef

jj2007

HSE

raymond

daydreamer

Lonewolff

jj2007

LordAdef

daydreamer

jj2007

LordAdef

jj2007

tenkey