Hi, I am testing various FPU instructions and would like to get some timings, especially from AMD.
A
fldpi
fistp stack
ftst
fnstsw ax
B
fldpi
fistp stack
; ftst
fnstsw ax
C
fldpi
fistp stack
; ftst
fstsw ax
D
fxam
fstsw ax
Results:
Intel(R) Celeron(R) M CPU 420 @ 1.60GHz (SSE3)
8143 cycles for 1000 * both
17701 cycles for 1000 * fstsw nowait
17722 cycles for 1000 * fstsw wait
4333 cycles for 1000 * fxam+fstsw ax
8144 cycles for 1000 * both
17722 cycles for 1000 * fstsw nowait
17740 cycles for 1000 * fstsw wait
4333 cycles for 1000 * fxam+fstsw ax
8144 cycles for 1000 * both
17697 cycles for 1000 * fstsw nowait
17718 cycles for 1000 * fstsw wait
4333 cycles for 1000 * fxam+fstsw ax
8143 cycles for 1000 * both
17743 cycles for 1000 * fstsw nowait
17754 cycles for 1000 * fstsw wait
4349 cycles for 1000 * fxam+fstsw ax
8145 cycles for 1000 * both
17700 cycles for 1000 * fstsw nowait
17726 cycles for 1000 * fstsw wait
4335 cycles for 1000 * fxam+fstsw ax
13 bytes for both
11 bytes for fstsw nowait
12 bytes for fstsw wait
8 bytes for fxam+fstsw ax
14368 = eax both
14368 = eax fstsw nowait
14368 = eax fstsw wait
15392 = eax fxam+fstsw ax
Intel(R) Core(TM) i5-2450M CPU @ 2.50GHz (SSE4)
2552 cycles for 1000 * both
2422 cycles for 1000 * fstsw nowait
2444 cycles for 1000 * fstsw wait
1960 cycles for 1000 * fxam+fstsw ax
2554 cycles for 1000 * both
2432 cycles for 1000 * fstsw nowait
2434 cycles for 1000 * fstsw wait
1969 cycles for 1000 * fxam+fstsw ax
2556 cycles for 1000 * both
2388 cycles for 1000 * fstsw nowait
2428 cycles for 1000 * fstsw wait
1964 cycles for 1000 * fxam+fstsw ax
2566 cycles for 1000 * both
2416 cycles for 1000 * fstsw nowait
2446 cycles for 1000 * fstsw wait
1953 cycles for 1000 * fxam+fstsw ax
8627 cycles for 1000 * both
8281 cycles for 1000 * fstsw nowait
9455 cycles for 1000 * fstsw wait
3785 cycles for 1000 * fxam+fstsw ax
13 bytes for both
11 bytes for fstsw nowait
12 bytes for fstsw wait
8 bytes for fxam+fstsw ax
14368 = eax both
14368 = eax fstsw nowait
14368 = eax fstsw wait
15392 = eax fxam+fstsw ax
AMD Phenom(tm) II X6 1045T Processor (SSE3)
+++++++13 of 20 tests valid, loop overhead is approx. 3584/1000 cycles
6264 cycles for 1000 * both
17819 cycles for 1000 * fstsw nowait
8535 cycles for 1000 * fstsw wait
8536 cycles for 1000 * fxam+fstsw ax
?? cycles for 1000 * both
17850 cycles for 1000 * fstsw nowait
8543 cycles for 1000 * fstsw wait
8506 cycles for 1000 * fxam+fstsw ax
?? cycles for 1000 * both
8294 cycles for 1000 * fstsw nowait
18049 cycles for 1000 * fstsw wait
8531 cycles for 1000 * fxam+fstsw ax
?? cycles for 1000 * both
8322 cycles for 1000 * fstsw nowait
8514 cycles for 1000 * fstsw wait
18049 cycles for 1000 * fxam+fstsw ax
?? cycles for 1000 * both
8317 cycles for 1000 * fstsw nowait
8527 cycles for 1000 * fstsw wait
8536 cycles for 1000 * fxam+fstsw ax
13 bytes for both
11 bytes for fstsw nowait
12 bytes for fstsw wait
8 bytes for fxam+fstsw ax
14368 = eax both
14368 = eax fstsw nowait
14368 = eax fstsw wait
15392 = eax fxam+fstsw ax
--- ok ---
AMD A10-7850K APU with Radeon(TM) R7 Graphics (SSE4)
3080 cycles for 1000 * both
21165 cycles for 1000 * fstsw nowait
21270 cycles for 1000 * fstsw wait
20876 cycles for 1000 * fxam+fstsw ax
3087 cycles for 1000 * both
20911 cycles for 1000 * fstsw nowait
20903 cycles for 1000 * fstsw wait
21190 cycles for 1000 * fxam+fstsw ax
3073 cycles for 1000 * both
21004 cycles for 1000 * fstsw nowait
21062 cycles for 1000 * fstsw wait
20859 cycles for 1000 * fxam+fstsw ax
3037 cycles for 1000 * both
20879 cycles for 1000 * fstsw nowait
21114 cycles for 1000 * fstsw wait
20816 cycles for 1000 * fxam+fstsw ax
3079 cycles for 1000 * both
20926 cycles for 1000 * fstsw nowait
20899 cycles for 1000 * fstsw wait
20865 cycles for 1000 * fxam+fstsw ax
13 bytes for both
11 bytes for fstsw nowait
12 bytes for fstsw wait
8 bytes for fxam+fstsw ax
14368 = eax both
14368 = eax fstsw nowait
14368 = eax fstsw wait
15392 = eax fxam+fstsw ax
Thanks a lot, jimg and sinsi :t
(I needed this for a little improvement of binary-to-ascii conversion) (http://masm32.com/board/index.php?topic=94.msg41105#msg41105)
This is on my el cheapo Asus laptop:
Intel(R) Celeron(R) CPU N2830 @ 2.16GHz (SSE4)
18002 cycles for 1000 * both
18077 cycles for 1000 * fstsw nowait
18163 cycles for 1000 * fstsw wait
10861 cycles for 1000 * fxam+fstsw ax
18095 cycles for 1000 * both
18091 cycles for 1000 * fstsw nowait
18115 cycles for 1000 * fstsw wait
10892 cycles for 1000 * fxam+fstsw ax
18146 cycles for 1000 * both
18308 cycles for 1000 * fstsw nowait
18106 cycles for 1000 * fstsw wait
10925 cycles for 1000 * fxam+fstsw ax
18105 cycles for 1000 * both
18132 cycles for 1000 * fstsw nowait
18138 cycles for 1000 * fstsw wait
10903 cycles for 1000 * fxam+fstsw ax
18056 cycles for 1000 * both
18130 cycles for 1000 * fstsw nowait
18114 cycles for 1000 * fstsw wait
10925 cycles for 1000 * fxam+fstsw ax
13 bytes for both
11 bytes for fstsw nowait
12 bytes for fstsw wait
8 bytes for fxam+fstsw ax
14368 = eax both
14368 = eax fstsw nowait
14368 = eax fstsw wait
15392 = eax fxam+fstsw ax
--- ok ---
prescott...
Intel(R) Pentium(R) 4 CPU 3.00GHz (SSE3)
8340 cycles for 1000 * both
8735 cycles for 1000 * fstsw nowait
8646 cycles for 1000 * fstsw wait
6103 cycles for 1000 * fxam+fstsw ax
8304 cycles for 1000 * both
8640 cycles for 1000 * fstsw nowait
8962 cycles for 1000 * fstsw wait
6161 cycles for 1000 * fxam+fstsw ax
8333 cycles for 1000 * both
8646 cycles for 1000 * fstsw nowait
8704 cycles for 1000 * fstsw wait
6105 cycles for 1000 * fxam+fstsw ax
8332 cycles for 1000 * both
8647 cycles for 1000 * fstsw nowait
8954 cycles for 1000 * fstsw wait
6098 cycles for 1000 * fxam+fstsw ax
8800 cycles for 1000 * both
8643 cycles for 1000 * fstsw nowait
8710 cycles for 1000 * fstsw wait
6105 cycles for 1000 * fxam+fstsw ax
13 bytes for both
11 bytes for fstsw nowait
12 bytes for fstsw wait
8 bytes for fxam+fstsw ax
14368 = eax both
14368 = eax fstsw nowait
14368 = eax fstsw wait
15392 = eax fxam+fstsw ax
Thanks a lot :t
For the curious, here is the application. When displaying a float with maximum precision, Str$() (http://www.webalice.it/jj2006/MasmBasicQuickReference.htm#Mb1186) uses an intermediate QWORD integer which gets converted with a variant of drizz' U64ToStr algo (http://www.masmforum.com/board/index.php?topic=9857.msg72422#msg72422). That works fine for numbers below 9223372036854775807 - which means you can get 19 digits precision for 92.23% of all numbers. But I had to catch somehow the remaining 7.7%, and that's where fnstsw ax has become helpful, as it catches the invalid operation when trying to fistp a number above the magic value.
fld st ; create a copy for testing overflow
fistp f2sTmp64 ; qword to mem - will choke for numbers higher than 922...
fnstsw ax ; pop flags
test al, 1
.if !Zero? ; overflow detected, qword is above 9.22x
fclex ; clear flag
dec edi ; decrease precision by one digit
fdiv f2s0dot1 ; divide by exactly 10.0 (slow but very rare)
fistp f2sTmp64 ; and pop it again
.else
fstp st ; correct fpu
.endif