Basic FPU questions

NoCforMe · January 25, 2022, 03:23:41 PM

Here's a bit of code I just wrote. It's a random-number generator, simple but a pretty good one. (I borrowed it from docs.microsoft.com/en-us/archive/msdn-magazine/2016/august/test-run-lightweight-random-number-generation on MSDN; it's the Lehmer algorithm from that article.)

Code Select


;============================================
; LehmerRand - Random Number Generator
;
; Contains following routines:
;
; LehmerRandInit()
;	Initializes RNG. Not necessary to call it, as it's
;	automagically called the first time you call LehmerRand().
;
; LehmerRand (range)
;	Returns a random # in the range [0 ... range-1].
;============================================


	include \masm32\include\masm32rt.inc


;============================================
; Defines, macros, prototypes, etc.
;============================================


;===== RNG values: =====
$a			EQU 16807
$m			EQU 2147483647
$q			EQU 127773
$r			EQU 2836


;============================================
; HERE BE DATA
;============================================
.data

LehmerInitFlag		DB FALSE


;============================================
; UNINITIALIZED DATA
;============================================
.data?

QPCvalue		LABEL DWORD
RandomSeed		LABEL DWORD		;Same as low part of counter.
  QPClow		DD ?
  QPChigh		DD ?

X87CW			DW ?
OldX87CW		DW ?


;============================================
; CODE LIVES HERE
;============================================
.code


;====================================================================
; LehmerRandInit()
;
; Sets up RNG for use by getting initial seed from hi-res timer.
;====================================================================

LehmerRandInit	PROC

; Seed the RNG from a time value:
	INVOKE	QueryPerformanceCounter, OFFSET QPCvalue
	AND	QPClow, 7FFFFFFFH		;Make sure seed doesn't overflow signed INT.
	MOV	LehmerInitFlag, TRUE
	RET

LehmerRandInit	ENDP


;====================================================================
; LehmerRand (range)
;
; Delivers a random number, based on the "seed".
; Automatically initializes the seed the first time it's called
; (user can also re-initialize by calling LehmerRandInit(), above)
;
; This code is derived from the MSDN article "Lightweight Random
; Number Generation"
; (https://docs.microsoft.com/en-us/archive/msdn-magazine/2016/august/test-run-lightweight-random-number-generation)
; This is the Lehmer algorithm flavor.
;
; range:
;	Upper limit of random #s desired (#s will be between 0 and [range-1]).
;
; Returns:
;	EAX = random number
;====================================================================

LehmerRand	PROC range:DWORD
	LOCAL	hi:DWORD, lo:DWORD, temp:DWORD

; Automagically initialize ourselves the 1st time through:
	CMP	LehmerInitFlag, TRUE
	JE	@F
	CALL	LehmerRandInit

@@:	MOV	EAX, RandomSeed
	XOR	EDX, EDX
	MOV	ECX, $q
	DIV	ECX
	MOV	hi, EAX			;hi = seed / q.
	MOV	lo, EDX			;lo = seed % q.

; seed = a*lo - r*hi:
	MOV	EAX, $a
	MUL	lo
	MOV	ECX, EAX		;Stash temporarily.
	MOV	EAX, $r
	MUL	hi
	SUB	ECX, EAX
	JNC	@F			;Is < 0?
	ADD	ECX, $m			;  Yep, so correct.
@@:	MOV	RandomSeed, ECX		;New seed for next time.

	FINIT				;Reset FPU (to clear stack regs.)

; Set rounding control field to "round down":
	FSTCW	X87CW
	MOV	AX, X87CW
	MOV	OldX87CW, AX		;Save current CW.
	AND	X87CW, 0F3FFH		;Clear RC field.
	OR	X87CW, 400H		;Set RC field = 01.
	FLDCW	X87CW

; Rand = seed / m:
	FILD	RandomSeed
	MOV	temp, $m
	FILD	temp
	FDIV

; Scale random # to range:
	FILD	range
	FMUL
	FISTP	temp
	MOV	EAX, temp

	FLDCW	OldX87CW		;Restore original CW.
	RET

LehmerRand	ENDP


	END

It makes some use of the FPU. My questions:

1. When I first tested this it would run fine 7 times but would fail on the 8th time, with an error from the FPU. Turned out I was using up all of the "stack" registers, which is why there's a FINIT in there now, to clear them. Is this the best way to do that? the only way? how else can you clear FPU registers? just store something (FSTP/FISTP) into a bit bucket? I thought I saw some instructions, xxxPP that did a double pop; could one use these? And is FINIT an expensive instruction, time-wise? Seems kind of drastic to do a total reset on the poor FPU every time through this routine.

2. Regarding FPU "etiquette": I needed to change the rounding mode here to "round down" so that the result would always be less than the "range" variable. That works fine. I took the trouble to save the existing code word and restore it afterwards. Is this necessary? It seems like the FPU is a resource available to any process that wants to use it, so I get the feeling that it's not that important to carefully save its complete state and then restore it when you're done. I mean, it's not like having to save and restore those "sacred" X86 registers (EBX, ESI, EDI, EBP), is it? Isn't each process responsible for setting up the FPU to its liking, with no assumptions about what state it's in?

I'm hoping Ray Filiatreault might drop by here, seeing as he's the resident FPU guru here ...

(BTW, regarding the Lehmer RNG, it's pretty good, but as the author points out it's not statistically rigorous and isn't suitable for applications like cryptographic security.)

daydreamer · January 25, 2022, 08:43:14 PM

Balanced Fld/fstp and use some real4 or real8 or REAL10 variables
Here rounding mode is if you scroll down
http://www.website.masmforum.com/tutorials/fptute/fpuchap1.htm

NoCforMe · January 25, 2022, 08:55:47 PM

"Balanced Fld/fstp"; so if I have more loads than stores, I need to do some fake stores (and pops) like I suggested?

OK, I tried that:

Code Select


;	FINIT				;Reset FPU (to clear stack regs.)

; Set rounding control field to "round down":
	FSTCW	X87CW
	MOV	AX, X87CW
	MOV	OldX87CW, AX		;Save current CW.
	AND	X87CW, 0F3FFH		;Clear RC field.
	OR	X87CW, 400H		;Set RC field = 01.
	FLDCW	X87CW

; Rand = seed / m:
	FILD	RandomSeed
	MOV	temp, $m
	FILD	temp
	FDIV

; Scale random # to range:
	FILD	range
	FMUL
	FISTP	temp
	MOV	EAX, temp

	FISTP	temp			;Clear stack.
	FISTP	temp

	FLDCW	OldX87CW		;Restore original CW.

And it works! Notice I commented out the FINIT.

jj2007 · January 25, 2022, 09:41:28 PM

Quote from: NoCforMe on January 25, 2022, 08:55:47 PM
"Balanced Fld/fstp"; so if I have more loads than stores, I need to do some fake stores (and pops) like I suggested?

In general, you will never need fake stores and pops, if the code is properly designed. You should absolutely read Raymond's explanation of the revolver barrel.

I tried your original code, and it works just fine, see attached usage example. Don't expect speed records, though - I've marked the slow instructions in the source. Results for a 100 Million numbers on my trusty old Core i5:

5585 ticks for LehmerRand
343 ticks for Rand()

NoCforMe · January 26, 2022, 11:38:50 AM

Quote from: jj2007 on January 25, 2022, 09:41:28 PM

In general, you will never need fake stores and pops, if the code is properly designed. You should absolutely read Raymond's explanation of the revolver barrel.

I don't see how my code here could achieve this without using "fake" stores and pops. Look at what I'm doing. F'rinstance, I have to do 2 loads for a multiply and 2 for a divide, and in neither instance am I putting values back into memory. So how am I supposed to achieve this without doing at least one fake store/pop? And I do understand the "revolver barrel" (much better analogy than a "stack", btw), which is how I discovered my error in the first place.

QuoteI tried your original code, and it works just fine, see attached usage example. Don't expect speed records, though - I've marked the slow instructions in the source. Results for a 100 Million numbers on my trusty old Core i5:

5585 ticks for LehmerRand
343 ticks for Rand()

Thanks for the heads up. So DIV is the main culprit; I guess one would want to code a routine using, what? shifts and rotates instead? Plus the FSTCW so I can change the rounding mode, which really needs to be done here.

Do you publish source for your MasmBasic routines? Curious to see how your Rand() works.

raymond · January 26, 2022, 01:33:44 PM

Random numbers

The fastest algo will depend primarily on what you intend to do with those numbers. If, for example, you only want to shuffle a deck of cards, you can design an effective one requiring only a few clock ticks. Always keep the ultimate purpose in mind.

FPU "etiquette"

It does not exist, since even the Windows OS considers the FPU registers free for all. I have reported that ever since WinXP when I realized that MS programmers had fallen in love with using the FPU registers in most of their functions dealing with the screen through the use of MMX instructions. This has necessarily not subsided with the expansion of monitor sizes and ratios.

As for the "sacred" X86 registers, they are not sacred anymore since post WinXP. I remember writing a program which worked fine on the next Windows version but crashed when trying it under WinXP; I had inadvertently mixed the popping order of one esi/edi register pair. The MS programmers had finally realized that it had become too risky to rely on other users for their preservation. Consider them "sacred" within your program if there is a necessity for it, or if you intend to export part of the code for public use.

FPU register management

Whenever you use the FPU, you should always be aware of the content of each of the registers ST0 to ST7. In that way, you will know (i) how to address each one of them properly and (ii) also make sure that ST7 is free if you need to load some other value onto the FPU. And, a standard way to get rid of the top register content and bring everything back up the chain is to use the fstp st instruction. (You can also look up the ffree instruction.)

NoCforMe · January 26, 2022, 01:41:35 PM

So since the FPU is now fair game for anyone to play with, maybe in my code I should just go ahead and set the code word to my liking and not bother to store the old one first, as JJ advises that FSTCW is a slow instruction.

Regarding clearing registers, is FSTP ST faster than my FIST <mem32>, since it references no memory? what about FFREE instead?

raymond · January 26, 2022, 02:27:38 PM

Quoteis FSTP ST faster than my FIST <mem32>, since it references no memory?

Not only it references no memory, but it doesn't even waste time to convert the floating number to an integer!!! It simply flushes it.

And yes, you set up the FPU control word to you needs and modify it as often as you want while you are using the FPU. Simply remember how you set it up throughout your program whenever you modify it.

The MASM Forum

News:

Basic FPU questions

NoCforMe

daydreamer

NoCforMe

jj2007

NoCforMe

raymond

NoCforMe

raymond