FPU timings needed

jj2007 · December 29, 2014, 09:58:33 AM

Hi, I am testing various FPU instructions and would like to get some timings, especially from AMD.

A
	fldpi
	fistp stack
	ftst
	fnstsw ax
B
	fldpi
	fistp stack
	; ftst
	fnstsw ax
C
	fldpi
	fistp stack
	; ftst
	fstsw ax
D
	fxam
	fstsw ax

Results:

Code Select

	Intel(R) Celeron(R) M CPU        420  @ 1.60GHz (SSE3)
	8143    cycles for 1000 * both
	17701   cycles for 1000 * fstsw nowait
	17722   cycles for 1000 * fstsw wait
	4333    cycles for 1000 * fxam+fstsw ax
	
	8144    cycles for 1000 * both
	17722   cycles for 1000 * fstsw nowait
	17740   cycles for 1000 * fstsw wait
	4333    cycles for 1000 * fxam+fstsw ax
	
	8144    cycles for 1000 * both
	17697   cycles for 1000 * fstsw nowait
	17718   cycles for 1000 * fstsw wait
	4333    cycles for 1000 * fxam+fstsw ax
	
	8143    cycles for 1000 * both
	17743   cycles for 1000 * fstsw nowait
	17754   cycles for 1000 * fstsw wait
	4349    cycles for 1000 * fxam+fstsw ax
	
	8145    cycles for 1000 * both
	17700   cycles for 1000 * fstsw nowait
	17726   cycles for 1000 * fstsw wait
	4335    cycles for 1000 * fxam+fstsw ax
	
	13      bytes for both
	11      bytes for fstsw nowait
	12      bytes for fstsw wait
	8       bytes for fxam+fstsw ax
	
	14368   = eax both
	14368   = eax fstsw nowait
	14368   = eax fstsw wait
	15392   = eax fxam+fstsw ax

	Intel(R) Core(TM) i5-2450M CPU @ 2.50GHz (SSE4)
	2552    cycles for 1000 * both
	2422    cycles for 1000 * fstsw nowait
	2444    cycles for 1000 * fstsw wait
	1960    cycles for 1000 * fxam+fstsw ax
	
	2554    cycles for 1000 * both
	2432    cycles for 1000 * fstsw nowait
	2434    cycles for 1000 * fstsw wait
	1969    cycles for 1000 * fxam+fstsw ax
	
	2556    cycles for 1000 * both
	2388    cycles for 1000 * fstsw nowait
	2428    cycles for 1000 * fstsw wait
	1964    cycles for 1000 * fxam+fstsw ax
	
	2566    cycles for 1000 * both
	2416    cycles for 1000 * fstsw nowait
	2446    cycles for 1000 * fstsw wait
	1953    cycles for 1000 * fxam+fstsw ax
	
	8627    cycles for 1000 * both
	8281    cycles for 1000 * fstsw nowait
	9455    cycles for 1000 * fstsw wait
	3785    cycles for 1000 * fxam+fstsw ax
	
	13      bytes for both
	11      bytes for fstsw nowait
	12      bytes for fstsw wait
	8       bytes for fxam+fstsw ax
	
	14368   = eax both
	14368   = eax fstsw nowait
	14368   = eax fstsw wait
	15392   = eax fxam+fstsw ax

jimg · December 29, 2014, 10:13:31 AM

AMD Phenom(tm) II X6 1045T Processor (SSE3)
+++++++13 of 20 tests valid, loop overhead is approx. 3584/1000 cycles

6264 cycles for 1000 * both
17819 cycles for 1000 * fstsw nowait
8535 cycles for 1000 * fstsw wait
8536 cycles for 1000 * fxam+fstsw ax

?? cycles for 1000 * both
17850 cycles for 1000 * fstsw nowait
8543 cycles for 1000 * fstsw wait
8506 cycles for 1000 * fxam+fstsw ax

?? cycles for 1000 * both
8294 cycles for 1000 * fstsw nowait
18049 cycles for 1000 * fstsw wait
8531 cycles for 1000 * fxam+fstsw ax

?? cycles for 1000 * both
8322 cycles for 1000 * fstsw nowait
8514 cycles for 1000 * fstsw wait
18049 cycles for 1000 * fxam+fstsw ax

?? cycles for 1000 * both
8317 cycles for 1000 * fstsw nowait
8527 cycles for 1000 * fstsw wait
8536 cycles for 1000 * fxam+fstsw ax

13 bytes for both
11 bytes for fstsw nowait
12 bytes for fstsw wait
8 bytes for fxam+fstsw ax

14368 = eax both
14368 = eax fstsw nowait
14368 = eax fstsw wait
15392 = eax fxam+fstsw ax

--- ok ---

sinsi · December 29, 2014, 11:20:51 AM

AMD A10-7850K APU with Radeon(TM) R7 Graphics (SSE4)

3080 cycles for 1000 * both
21165 cycles for 1000 * fstsw nowait
21270 cycles for 1000 * fstsw wait
20876 cycles for 1000 * fxam+fstsw ax

3087 cycles for 1000 * both
20911 cycles for 1000 * fstsw nowait
20903 cycles for 1000 * fstsw wait
21190 cycles for 1000 * fxam+fstsw ax

3073 cycles for 1000 * both
21004 cycles for 1000 * fstsw nowait
21062 cycles for 1000 * fstsw wait
20859 cycles for 1000 * fxam+fstsw ax

3037 cycles for 1000 * both
20879 cycles for 1000 * fstsw nowait
21114 cycles for 1000 * fstsw wait
20816 cycles for 1000 * fxam+fstsw ax

3079 cycles for 1000 * both
20926 cycles for 1000 * fstsw nowait
20899 cycles for 1000 * fstsw wait
20865 cycles for 1000 * fxam+fstsw ax

13 bytes for both
11 bytes for fstsw nowait
12 bytes for fstsw wait
8 bytes for fxam+fstsw ax

14368 = eax both
14368 = eax fstsw nowait
14368 = eax fstsw wait
15392 = eax fxam+fstsw ax

jj2007 · December 29, 2014, 12:01:57 PM

Thanks a lot, jimg and sinsi :t

(I needed this for a little improvement of binary-to-ascii conversion)

MichaelW · December 29, 2014, 07:45:21 PM

This is on my el cheapo Asus laptop:

Code Select


Intel(R) Celeron(R) CPU  N2830  @ 2.16GHz (SSE4)

18002   cycles for 1000 * both
18077   cycles for 1000 * fstsw nowait
18163   cycles for 1000 * fstsw wait
10861   cycles for 1000 * fxam+fstsw ax

18095   cycles for 1000 * both
18091   cycles for 1000 * fstsw nowait
18115   cycles for 1000 * fstsw wait
10892   cycles for 1000 * fxam+fstsw ax

18146   cycles for 1000 * both
18308   cycles for 1000 * fstsw nowait
18106   cycles for 1000 * fstsw wait
10925   cycles for 1000 * fxam+fstsw ax

18105   cycles for 1000 * both
18132   cycles for 1000 * fstsw nowait
18138   cycles for 1000 * fstsw wait
10903   cycles for 1000 * fxam+fstsw ax

18056   cycles for 1000 * both
18130   cycles for 1000 * fstsw nowait
18114   cycles for 1000 * fstsw wait
10925   cycles for 1000 * fxam+fstsw ax

13      bytes for both
11      bytes for fstsw nowait
12      bytes for fstsw wait
8       bytes for fxam+fstsw ax

14368   = eax both
14368   = eax fstsw nowait
14368   = eax fstsw wait
15392   = eax fxam+fstsw ax

--- ok ---

dedndave · December 30, 2014, 12:36:44 AM

prescott...

Code Select

Intel(R) Pentium(R) 4 CPU 3.00GHz (SSE3)

8340    cycles for 1000 * both
8735    cycles for 1000 * fstsw nowait
8646    cycles for 1000 * fstsw wait
6103    cycles for 1000 * fxam+fstsw ax

8304    cycles for 1000 * both
8640    cycles for 1000 * fstsw nowait
8962    cycles for 1000 * fstsw wait
6161    cycles for 1000 * fxam+fstsw ax

8333    cycles for 1000 * both
8646    cycles for 1000 * fstsw nowait
8704    cycles for 1000 * fstsw wait
6105    cycles for 1000 * fxam+fstsw ax

8332    cycles for 1000 * both
8647    cycles for 1000 * fstsw nowait
8954    cycles for 1000 * fstsw wait
6098    cycles for 1000 * fxam+fstsw ax

8800    cycles for 1000 * both
8643    cycles for 1000 * fstsw nowait
8710    cycles for 1000 * fstsw wait
6105    cycles for 1000 * fxam+fstsw ax

13      bytes for both
11      bytes for fstsw nowait
12      bytes for fstsw wait
8       bytes for fxam+fstsw ax

14368   = eax both
14368   = eax fstsw nowait
14368   = eax fstsw wait
15392   = eax fxam+fstsw ax

jj2007 · December 30, 2014, 04:07:49 AM

Thanks a lot :t

For the curious, here is the application. When displaying a float with maximum precision, Str$() uses an intermediate QWORD integer which gets converted with a variant of drizz' U64ToStr algo. That works fine for numbers below 9223372036854775807 - which means you can get 19 digits precision for 92.23% of all numbers. But I had to catch somehow the remaining 7.7%, and that's where fnstsw ax has become helpful, as it catches the invalid operation when trying to fistp a number above the magic value.

fld st ; create a copy for testing overflow
fistp f2sTmp64 ; qword to mem - will choke for numbers higher than 922...
fnstsw ax ; pop flags
test al, 1
.if !Zero? ; overflow detected, qword is above 9.22x
fclex ; clear flag
dec edi ; decrease precision by one digit
fdiv f2s0dot1 ; divide by exactly 10.0 (slow but very rare)
fistp f2sTmp64 ; and pop it again
.else
fstp st ; correct fpu
.endif

The MASM Forum

News:

FPU timings needed

jj2007

jimg

sinsi

jj2007

MichaelW

dedndave

jj2007