News:

Masm32 SDK description, downloads and other helpful links
Message to All Guests
NB: Posting URL's See here: Posted URL Change

Main Menu

The Floating point road

Started by LordAdef, May 09, 2018, 03:23:54 PM

Previous topic - Next topic

LordAdef

Hi, after hammering over this FPU code fpr hours and not getting the right result, I came to realize printf ("result: %f", real4Value) doesn't work (or am I doing something stupid as usual?)

I decide I need to study floating points (in Asm obviously..) once and for all. So I cracked into "Modern x86 Assembly Language Programming" by D. Kusswurm (I owned a copy, after Hutch's suggestion, GREAT book btw).

Since he was outputting his results through printf  in c++, and I was having the same 0.0000 results. I am now sure my printf doesn't work.
The question is, why??

I'm using MasmBasic Print Str$ (val3)


Question #1:
Since I can really learn this stuff if I output results, what's the way to do it. A simple and straightforward way to do it?

Question #2:
I've already studied Raymond's FPU great tutorial and am now studying SSE

When is one suitable over the other? I saw Marinus sometimes opted for FPU and sometimes he prefer SSE.

edit to add a Question #3:
In terms of efficiency, is it better to do any pre integer computing in register and only then take the values to SSE with cvtsi2ss (well, I learnt this one today, and it works :lol: ) or do the whole thing in SSE?

tenkey

The C/C++ compiler converts floats (REAL4) to doubles (REAL8), so printf's %f handles only REAL8.

LordAdef

Quote from: tenkey on May 09, 2018, 03:50:56 PM
The C/C++ compiler converts floats (REAL4) to doubles (REAL8), so %f handles only REAL8.

Thank you for that, sir!

jj2007

Quote from: LordAdef on May 09, 2018, 03:23:54 PMI'm using MasmBasic Print Str$ (val3)
And it works, I guess :P

QuoteQuestion #2:
I've already studied Raymond's FPU great tutorial and am now studying SSE

When is one suitable over the other?
Under the hood, the cpu probably uses the same circuits. Personally, I prefer the FPU for standard tasks - it is as fast as the SIMD instructions but uses a higher precision. And it has become a "protected area" because the OS hardly uses it any more, so you are free to do whatever you want.

SSE2 is the best choice for anything that can be parallelised. Multiply four DWORDs with one instruction - that you cannot beat with 4 FPU instructions.

SSE3 and higher offer some more exotic instructions; you will rarely need them, and you limit your code to newer CPUs. Not a big problem nowadays, any machine that is younger than 5 or 10 years has them, but it is a point to remember.

To give you an indication: MasmBasic uses roughly the same amount of FPU and SSE2 code.

One important point: If you are calling Windows APIs, check if your FPU or SIMD registers are still intact. MasmBasic saves the xmmregs because I discovered that after WinXP, the OS started to trash xmm0 ... xmm3. See Windows trashes xmm regs but not the FPU

HSE

Quote from: LordAdef on May 09, 2018, 03:23:54 PM
Question #1:
Since I can really learn this stuff if I output results, what's the way to do it. A simple and straightforward way to do it?

You can see macros.asm

print real4$(val3)
Equations in Assembly: SmplMath

raymond

QuoteQuestion #1:
Since I can really learn this stuff if I output results, what's the way to do it. A simple and straightforward way to do it?

Another option is to use the FpuFLtoA function in the FPU library generally provided with the MASM32 FDK. You can also download that library from the same site as the FPU tutorial, i.e. http://www.ray.masmcode.com/fpu.html#fpulib. The resulting string which you can pre-format to some extent gets transfered to a memory location of your choice, from where you can display it wherever/whenever/however you may wish.
Whenever you assume something, you risk being wrong half the time.
https://masm32.com/masmcode/rayfil/index.html

daydreamer

Quote from: LordAdef on May 09, 2018, 03:23:54 PM
Question #1:
Since I can really learn this stuff if I output results, what's the way to do it. A simple and straightforward way to do it?

Question #2:
I've already studied Raymond's FPU great tutorial and am now studying SSE

When is one suitable over the other? I saw Marinus sometimes opted for FPU and sometimes he prefer SSE.

edit to add a Question #3:
In terms of efficiency, is it better to do any pre integer computing in register and only then take the values to SSE with cvtsi2ss (well, I learnt this one today, and it works :lol: ) or do the whole thing in SSE?
what floats your boat?
#1, why dont you search forum and old forum, to see all students old posts about exercise on display integers,floats with help of print
#2 what floats your boat? and what kind of asm program you want to make as exercise when Learning one of them?
advanced math calculator,circledrawing,hyperbole,etc code?fpu and fpu library or check Raymonds fixed Point library to learn howto things was made Before fpu on crappy few mhz computers?
fractals made with fpu code is in ron Thomas ebook,even if its dos fpu works the same with real4,real8,real10's
image processing its great with SSE2/MMX that handles 4 ARGB channel simultanously with builtin caps for max 0-255,for sound it has sounds processing versions of it handling signed values instead,you can choose between 8bits,16bit,32bits packed instructions
Randy Hyde has a good chapter on MMX(about the same as SSE2 integer instructions that use 128bit .xmm regs instead) in his assembly book
if you want same quality as pixelshader, make use of 4 floats for ARGB processing, use SSE/SSE2 floating Point instructions instead,2d light effect easily made with SQRTPS,RSQRTPS and a final float-integer conversion you have a bitmap
or SSE floats with math library or without if you like to write your own sine/cosine function or other math function,matrices

#3,check Raymonds fixed Point library,if you learn how it works, you can both use usual register or learn to do same with SSE2 integer the same way

do you want to test write code that works this way instead:
it takes in 4 conditional checks and creates 4 masks, that is either 0 or FFFFh,that you can use for mask out a mathematical operations

instead of 4 if's conditional jump code?

@JJ,why dont you make two macros that just pops and pushes only those regs OS trashes, instead of FXSAVE and FXRSTORE?
that would be useful for all of use who want to code SSE code without wasting energy and time on strange bugs, that OS trashing is cause of?




my none asm creations
https://masm32.com/board/index.php?topic=6937.msg74303#msg74303
I am an Invoker
"An Invoker is a mage who specializes in the manipulation of raw and elemental energies."
Like SIMD coding

Lonewolff

Quote from: tenkey on May 09, 2018, 03:50:56 PM
The C/C++ compiler converts floats (REAL4) to doubles (REAL8)

Wut??  :icon_confused:

jj2007

Quote from: daydreamer on May 10, 2018, 04:09:30 AM@JJ,why dont you make two macros that just pops and pushes only those regs OS trashes, instead of FXSAVE and FXRSTORE?

According to Agner Fog, the OS trashes all 8 xmm regs in 32-bit code. Speedwise it is still better to save them by hand:
Intel(R) Core(TM) i5-2450M CPU @ 2.50GHz (SSE4)

13275   cycles for 100 * fxsave/fxrstor
676     cycles for 100 * save xmm0 .. 7

13297   cycles for 100 * fxsave/fxrstor
680     cycles for 100 * save xmm0 .. 7


But:
14      bytes for fxsave/fxrstor
112     bytes for save xmm0 .. 7


And speed doesn't play any role when calling the Windows API. Guess what happens if you squeeze two little WinAPI calls into this loop:
  .Repeat
ct=0
REPEAT 8
movaps MbXs[ct*OWORD], @CatStr(<xmm>, %ct)
ct=ct+1
ENDM
if 0
sub esp, RECT
invoke GetWindowRect, rv(GetConsoleWindow), esp
add esp, RECT
endif
ct=0
REPEAT 8
movaps @CatStr(<xmm>, %ct), MbXs[ct*OWORD]
ct=ct+1
ENDM
dec ebx
  .Until Sign?

13      kCycles for 100 * fxsave/fxrstor
0       kCycles for 100 * save xmm0 .. 7
1700    kCycles for 100 * save xmm0 .. 7, plus a WinAPI call


With if 0, 0 kCycles
With if 1, 1700 kCycles, a factor 130 slower than the "slow" fxsave/fxrstor sequence.

LordAdef

If needed, where the best place to save them? Is it possible to save them to FPU? Or better to variables?

daydreamer

Quote from: LordAdef on May 10, 2018, 07:59:25 AM
If needed, where the best place to save them? Is it possible to save them to FPU? Or better to variables?
fpu regs is too small to save all xmm regs
check this thread and its macros and links to other masm threads
http://masm32.com/board/index.php?topic=7123.0
you maybe prefer dynamic allocated memory over .data? section,works as long as you use align 16, or indirect reg aligned on 16byte boundary
check Hutch 64bit solution
http://masm32.com/board/index.php?topic=7121.msg76859#new
my none asm creations
https://masm32.com/board/index.php?topic=6937.msg74303#msg74303
I am an Invoker
"An Invoker is a mage who specializes in the manipulation of raw and elemental energies."
Like SIMD coding

jj2007

Quoteyou maybe prefer dynamic allocated memory over .data? section,works as long as you use align 16, or indirect reg aligned on 16byte boundary

For such small amounts, the .data? section is ok (but not thread-safe); a local variable will also do the job. If you want to preserve only a few xmm regs, don't worry about align 16, just use movups instead of movaps. The performance penalty is negligible.

LordAdef

Quote from: jj2007 on May 10, 2018, 10:16:26 AM
Quoteyou maybe prefer dynamic allocated memory over .data? section,works as long as you use align 16, or indirect reg aligned on 16byte boundary

For such small amounts, the .data? section is ok (but not thread-safe); a local variable will also do the job. If you want to preserve only a few xmm regs, don't worry about align 16, just use movups instead of movaps. The performance penalty is negligible.

FadeInTerrain proc tTrn, tLines
       ; blendStep = 255/lineCount * mapspace
    LOCAL tBlnd:REAL4

    mov      tBlnd, 255.0
    movss     xmm0, tBlnd
    cvtsi2ss xmm1, tLines
    divss     xmm0, xmm1
    movss     fadeIncrement, xmm0               

    ; push ecx
    ; con "fadeStepVal is %s", real4$(fadeIncrement)
    ; pop ecx

    mov edi, tTrn
    andcurrBlendFloat, 0.0                    ; reset blend float value
    mov trn.blend[edi], 0                    ; blend is 0
    mov trn.isFade[edi], 1                    ; 1== fadein
    ret
FadeInTerrain endp


This code is working. You mentioned movaps/mocups, but I'm using movss. Any Thoughts?

Concerning the saving place for the floats, how about a struct of floats as container?

jj2007

Quote from: LordAdef on May 10, 2018, 10:38:36 AMYou mentioned movaps/mocups, but I'm using movss. Any Thoughts?
movss is perfect for singles, and movd does the same. The movaps and movups instructions deal with the full OWORDs.

QuoteConcerning the saving place for the floats, how about a struct of floats as container?
Good idea :t

FadeInTerrain proc tTrn, tLines
  ; blendStep = 255/lineCount * mapspace
if 0
LOCAL tBlnd:REAL4
mov      tBlnd, 255.0
movss     xmm0, tBlnd
else
.data
tBlnd REAL4 255.0 ; 5 bytes shorter
.code
movss     xmm0, tBlnd
endif

tenkey

Quote from: Ascended on May 10, 2018, 07:21:03 AM
Quote from: tenkey on May 09, 2018, 03:50:56 PM
The C/C++ compiler converts floats (REAL4) to doubles (REAL8)

Wut??  :icon_confused:
The printf function is built to work seamlessly with C code. That means printf's floating point arguments must be REAL8 when you are calling directly from ASM.