News:

Masm32 SDK description, downloads and other helpful links
Message to All Guests
NB: Posting URL's See here: Posted URL Change

Main Menu

Basic FPU questions

Started by NoCforMe, January 25, 2022, 03:23:41 PM

Previous topic - Next topic

NoCforMe

Here's a bit of code I just wrote. It's a random-number generator, simple but a pretty good one. (I borrowed it from docs.microsoft.com/en-us/archive/msdn-magazine/2016/august/test-run-lightweight-random-number-generation on MSDN; it's the Lehmer algorithm from that article.)


;============================================
; LehmerRand - Random Number Generator
;
; Contains following routines:
;
; LehmerRandInit()
; Initializes RNG. Not necessary to call it, as it's
; automagically called the first time you call LehmerRand().
;
; LehmerRand (range)
; Returns a random # in the range [0 ... range-1].
;============================================


include \masm32\include\masm32rt.inc


;============================================
; Defines, macros, prototypes, etc.
;============================================


;===== RNG values: =====
$a EQU 16807
$m EQU 2147483647
$q EQU 127773
$r EQU 2836


;============================================
; HERE BE DATA
;============================================
.data

LehmerInitFlag DB FALSE


;============================================
; UNINITIALIZED DATA
;============================================
.data?

QPCvalue LABEL DWORD
RandomSeed LABEL DWORD ;Same as low part of counter.
  QPClow DD ?
  QPChigh DD ?

X87CW DW ?
OldX87CW DW ?


;============================================
; CODE LIVES HERE
;============================================
.code


;====================================================================
; LehmerRandInit()
;
; Sets up RNG for use by getting initial seed from hi-res timer.
;====================================================================

LehmerRandInit PROC

; Seed the RNG from a time value:
INVOKE QueryPerformanceCounter, OFFSET QPCvalue
AND QPClow, 7FFFFFFFH ;Make sure seed doesn't overflow signed INT.
MOV LehmerInitFlag, TRUE
RET

LehmerRandInit ENDP


;====================================================================
; LehmerRand (range)
;
; Delivers a random number, based on the "seed".
; Automatically initializes the seed the first time it's called
; (user can also re-initialize by calling LehmerRandInit(), above)
;
; This code is derived from the MSDN article "Lightweight Random
; Number Generation"
; (https://docs.microsoft.com/en-us/archive/msdn-magazine/2016/august/test-run-lightweight-random-number-generation)
; This is the Lehmer algorithm flavor.
;
; range:
; Upper limit of random #s desired (#s will be between 0 and [range-1]).
;
; Returns:
; EAX = random number
;====================================================================

LehmerRand PROC range:DWORD
LOCAL hi:DWORD, lo:DWORD, temp:DWORD

; Automagically initialize ourselves the 1st time through:
CMP LehmerInitFlag, TRUE
JE @F
CALL LehmerRandInit

@@: MOV EAX, RandomSeed
XOR EDX, EDX
MOV ECX, $q
DIV ECX
MOV hi, EAX ;hi = seed / q.
MOV lo, EDX ;lo = seed % q.

; seed = a*lo - r*hi:
MOV EAX, $a
MUL lo
MOV ECX, EAX ;Stash temporarily.
MOV EAX, $r
MUL hi
SUB ECX, EAX
JNC @F ;Is < 0?
ADD ECX, $m ;  Yep, so correct.
@@: MOV RandomSeed, ECX ;New seed for next time.

FINIT ;Reset FPU (to clear stack regs.)

; Set rounding control field to "round down":
FSTCW X87CW
MOV AX, X87CW
MOV OldX87CW, AX ;Save current CW.
AND X87CW, 0F3FFH ;Clear RC field.
OR X87CW, 400H ;Set RC field = 01.
FLDCW X87CW

; Rand = seed / m:
FILD RandomSeed
MOV temp, $m
FILD temp
FDIV

; Scale random # to range:
FILD range
FMUL
FISTP temp
MOV EAX, temp

FLDCW OldX87CW ;Restore original CW.
RET

LehmerRand ENDP


END



It makes some use of the FPU. My questions:

1. When I first tested this it would run fine 7 times but would fail on the 8th time, with an error from the FPU. Turned out I was using up all of the "stack" registers, which is why there's a FINIT in there now, to clear them. Is this the best way to do that? the only way? how else can you clear FPU registers? just store something (FSTP/FISTP) into a bit bucket? I thought I saw some instructions, xxxPP that did a double pop; could one use these? And is FINIT an expensive instruction, time-wise? Seems kind of drastic to do a total reset on the poor FPU every time through this routine.

2. Regarding FPU "etiquette": I needed to change the rounding mode here to "round down" so that the result would always be less than the "range" variable. That works fine. I took the trouble to save the existing code word and restore it afterwards. Is this necessary? It seems like the FPU is a resource available to any process that wants to use it, so I get the feeling that it's not that important to carefully save its complete state and then restore it when you're done. I mean, it's not like having to save and restore those "sacred" X86 registers (EBX, ESI, EDI, EBP), is it? Isn't each process responsible for setting up the FPU to its liking, with no assumptions about what state it's in?

I'm hoping Ray Filiatreault might drop by here, seeing as he's the resident FPU guru here ...

(BTW, regarding the Lehmer RNG, it's pretty good, but as the author points out it's not statistically rigorous and isn't suitable for applications like cryptographic security.)
Assembly language programming should be fun. That's why I do it.

daydreamer

Balanced Fld/fstp and use some real4 or real8 or REAL10 variables
Here rounding mode is if you scroll down
http://www.website.masmforum.com/tutorials/fptute/fpuchap1.htm
my none asm creations
https://masm32.com/board/index.php?topic=6937.msg74303#msg74303
I am an Invoker
"An Invoker is a mage who specializes in the manipulation of raw and elemental energies."
Like SIMD coding

NoCforMe

"Balanced Fld/fstp"; so if I have more loads than stores, I need to do some fake stores (and pops) like I suggested?

OK, I tried that:

; FINIT ;Reset FPU (to clear stack regs.)

; Set rounding control field to "round down":
FSTCW X87CW
MOV AX, X87CW
MOV OldX87CW, AX ;Save current CW.
AND X87CW, 0F3FFH ;Clear RC field.
OR X87CW, 400H ;Set RC field = 01.
FLDCW X87CW

; Rand = seed / m:
FILD RandomSeed
MOV temp, $m
FILD temp
FDIV

; Scale random # to range:
FILD range
FMUL
FISTP temp
MOV EAX, temp

FISTP temp ;Clear stack.
FISTP temp

FLDCW OldX87CW ;Restore original CW.


And it works! Notice I commented out the FINIT.
Assembly language programming should be fun. That's why I do it.

jj2007

Quote from: NoCforMe on January 25, 2022, 08:55:47 PM
"Balanced Fld/fstp"; so if I have more loads than stores, I need to do some fake stores (and pops) like I suggested?

In general, you will never need fake stores and pops, if the code is properly designed. You should absolutely read Raymond's explanation of the revolver barrel.

I tried your original code, and it works just fine, see attached usage example. Don't expect speed records, though - I've marked the slow instructions in the source. Results for a 100 Million numbers on my trusty old Core i5:

5585 ticks for LehmerRand
343 ticks for Rand()

NoCforMe

Quote from: jj2007 on January 25, 2022, 09:41:28 PM

In general, you will never need fake stores and pops, if the code is properly designed. You should absolutely read Raymond's explanation of the revolver barrel.

I don't see how my code here could achieve this without using "fake" stores and pops. Look at what I'm doing. F'rinstance, I have to do 2 loads for a multiply and 2 for a divide, and in neither instance am I putting values back into memory. So how am I supposed to achieve this without doing at least one fake store/pop? And I do understand the "revolver barrel" (much better analogy than a "stack", btw), which is how I discovered my error in the first place.

QuoteI tried your original code, and it works just fine, see attached usage example. Don't expect speed records, though - I've marked the slow instructions in the source. Results for a 100 Million numbers on my trusty old Core i5:

5585 ticks for LehmerRand
343 ticks for Rand()

Thanks for the heads up. So DIV is the main culprit; I guess one would want to code a routine using, what? shifts and rotates instead? Plus the FSTCW so I can change the rounding mode, which really needs to be done here.

Do you publish source for your MasmBasic routines? Curious to see how your Rand() works.
Assembly language programming should be fun. That's why I do it.

raymond

Random numbers

The fastest algo will depend primarily on what you intend to do with those numbers. If, for example, you only want to shuffle a deck of cards, you can design an effective one requiring only a few clock ticks. Always keep the ultimate purpose in mind.

FPU "etiquette"

It does not exist, since even the Windows OS considers the FPU registers free for all. I have reported that ever since WinXP when I realized that MS programmers had fallen in love with using the FPU registers in most of their functions dealing with the screen through the use of MMX instructions. This has necessarily not subsided with the expansion of monitor sizes and ratios.

As for the "sacred" X86 registers, they are not sacred anymore since post WinXP. I remember writing a program which worked fine on the next Windows version but crashed when trying it under WinXP; I had inadvertently mixed the popping order of one esi/edi register pair. The MS programmers had finally realized that it had become too risky to rely on other users for their preservation. Consider them "sacred" within your program if there is a necessity for it, or if you intend to export part of the code for public use.

FPU register management

Whenever you use the FPU, you should always be aware of the content of each of the registers ST0 to ST7. In that way, you will know (i) how to address each one of them properly and (ii) also make sure that ST7 is free if you need to load some other value onto the FPU. And, a standard way to get rid of the top register content and bring everything back up the chain is to use the fstp st instruction. (You can also look up the ffree instruction.)
Whenever you assume something, you risk being wrong half the time.
https://masm32.com/masmcode/rayfil/index.html

NoCforMe

So since the FPU is now fair game for anyone to play with, maybe in my code I should just go ahead and set the code word to my liking and not bother to store the old one first, as JJ advises that FSTCW is a slow instruction.

Regarding clearing registers, is FSTP ST faster than my FIST <mem32>, since it references no memory? what about FFREE instead?
Assembly language programming should be fun. That's why I do it.

raymond

Quoteis FSTP ST faster than my FIST <mem32>, since it references no memory?

Not only it references no memory, but it doesn't even waste time to convert the floating number to an integer!!! It simply flushes it.

And yes, you set up the FPU control word to you needs and modify it as often as you want while you are using the FPU. Simply remember how you set it up throughout your program whenever you modify it.
Whenever you assume something, you risk being wrong half the time.
https://masm32.com/masmcode/rayfil/index.html