News:

Masm32 SDK description, downloads and other helpful links
Message to All Guests

Main Menu

FPU timings needed

Started by jj2007, December 29, 2014, 09:58:33 AM

Previous topic - Next topic

jj2007

Hi, I am testing various FPU instructions and would like to get some timings, especially from AMD.

A
fldpi
fistp stack
ftst
fnstsw ax
B
fldpi
fistp stack
; ftst
fnstsw ax
C
fldpi
fistp stack
; ftst
fstsw ax
D
fxam
fstsw ax


Results:
Intel(R) Celeron(R) M CPU        420  @ 1.60GHz (SSE3)
8143    cycles for 1000 * both
17701   cycles for 1000 * fstsw nowait
17722   cycles for 1000 * fstsw wait
4333    cycles for 1000 * fxam+fstsw ax

8144    cycles for 1000 * both
17722   cycles for 1000 * fstsw nowait
17740   cycles for 1000 * fstsw wait
4333    cycles for 1000 * fxam+fstsw ax

8144    cycles for 1000 * both
17697   cycles for 1000 * fstsw nowait
17718   cycles for 1000 * fstsw wait
4333    cycles for 1000 * fxam+fstsw ax

8143    cycles for 1000 * both
17743   cycles for 1000 * fstsw nowait
17754   cycles for 1000 * fstsw wait
4349    cycles for 1000 * fxam+fstsw ax

8145    cycles for 1000 * both
17700   cycles for 1000 * fstsw nowait
17726   cycles for 1000 * fstsw wait
4335    cycles for 1000 * fxam+fstsw ax

13      bytes for both
11      bytes for fstsw nowait
12      bytes for fstsw wait
8       bytes for fxam+fstsw ax

14368   = eax both
14368   = eax fstsw nowait
14368   = eax fstsw wait
15392   = eax fxam+fstsw ax

Intel(R) Core(TM) i5-2450M CPU @ 2.50GHz (SSE4)
2552    cycles for 1000 * both
2422    cycles for 1000 * fstsw nowait
2444    cycles for 1000 * fstsw wait
1960    cycles for 1000 * fxam+fstsw ax

2554    cycles for 1000 * both
2432    cycles for 1000 * fstsw nowait
2434    cycles for 1000 * fstsw wait
1969    cycles for 1000 * fxam+fstsw ax

2556    cycles for 1000 * both
2388    cycles for 1000 * fstsw nowait
2428    cycles for 1000 * fstsw wait
1964    cycles for 1000 * fxam+fstsw ax

2566    cycles for 1000 * both
2416    cycles for 1000 * fstsw nowait
2446    cycles for 1000 * fstsw wait
1953    cycles for 1000 * fxam+fstsw ax

8627    cycles for 1000 * both
8281    cycles for 1000 * fstsw nowait
9455    cycles for 1000 * fstsw wait
3785    cycles for 1000 * fxam+fstsw ax

13      bytes for both
11      bytes for fstsw nowait
12      bytes for fstsw wait
8       bytes for fxam+fstsw ax

14368   = eax both
14368   = eax fstsw nowait
14368   = eax fstsw wait
15392   = eax fxam+fstsw ax

jimg

AMD Phenom(tm) II X6 1045T Processor (SSE3)
+++++++13 of 20 tests valid, loop overhead is approx. 3584/1000 cycles

6264    cycles for 1000 * both
17819   cycles for 1000 * fstsw nowait
8535    cycles for 1000 * fstsw wait
8536    cycles for 1000 * fxam+fstsw ax

??      cycles for 1000 * both
17850   cycles for 1000 * fstsw nowait
8543    cycles for 1000 * fstsw wait
8506    cycles for 1000 * fxam+fstsw ax

??      cycles for 1000 * both
8294    cycles for 1000 * fstsw nowait
18049   cycles for 1000 * fstsw wait
8531    cycles for 1000 * fxam+fstsw ax

??      cycles for 1000 * both
8322    cycles for 1000 * fstsw nowait
8514    cycles for 1000 * fstsw wait
18049   cycles for 1000 * fxam+fstsw ax

??      cycles for 1000 * both
8317    cycles for 1000 * fstsw nowait
8527    cycles for 1000 * fstsw wait
8536    cycles for 1000 * fxam+fstsw ax

13      bytes for both
11      bytes for fstsw nowait
12      bytes for fstsw wait
8       bytes for fxam+fstsw ax

14368   = eax both
14368   = eax fstsw nowait
14368   = eax fstsw wait
15392   = eax fxam+fstsw ax

--- ok ---

sinsi

AMD A10-7850K APU with Radeon(TM) R7 Graphics   (SSE4)

3080    cycles for 1000 * both
21165   cycles for 1000 * fstsw nowait
21270   cycles for 1000 * fstsw wait
20876   cycles for 1000 * fxam+fstsw ax

3087    cycles for 1000 * both
20911   cycles for 1000 * fstsw nowait
20903   cycles for 1000 * fstsw wait
21190   cycles for 1000 * fxam+fstsw ax

3073    cycles for 1000 * both
21004   cycles for 1000 * fstsw nowait
21062   cycles for 1000 * fstsw wait
20859   cycles for 1000 * fxam+fstsw ax

3037    cycles for 1000 * both
20879   cycles for 1000 * fstsw nowait
21114   cycles for 1000 * fstsw wait
20816   cycles for 1000 * fxam+fstsw ax

3079    cycles for 1000 * both
20926   cycles for 1000 * fstsw nowait
20899   cycles for 1000 * fstsw wait
20865   cycles for 1000 * fxam+fstsw ax

13      bytes for both
11      bytes for fstsw nowait
12      bytes for fstsw wait
8       bytes for fxam+fstsw ax

14368   = eax both
14368   = eax fstsw nowait
14368   = eax fstsw wait
15392   = eax fxam+fstsw ax


MichaelW

This is on my el cheapo Asus laptop:

Intel(R) Celeron(R) CPU  N2830  @ 2.16GHz (SSE4)

18002   cycles for 1000 * both
18077   cycles for 1000 * fstsw nowait
18163   cycles for 1000 * fstsw wait
10861   cycles for 1000 * fxam+fstsw ax

18095   cycles for 1000 * both
18091   cycles for 1000 * fstsw nowait
18115   cycles for 1000 * fstsw wait
10892   cycles for 1000 * fxam+fstsw ax

18146   cycles for 1000 * both
18308   cycles for 1000 * fstsw nowait
18106   cycles for 1000 * fstsw wait
10925   cycles for 1000 * fxam+fstsw ax

18105   cycles for 1000 * both
18132   cycles for 1000 * fstsw nowait
18138   cycles for 1000 * fstsw wait
10903   cycles for 1000 * fxam+fstsw ax

18056   cycles for 1000 * both
18130   cycles for 1000 * fstsw nowait
18114   cycles for 1000 * fstsw wait
10925   cycles for 1000 * fxam+fstsw ax

13      bytes for both
11      bytes for fstsw nowait
12      bytes for fstsw wait
8       bytes for fxam+fstsw ax

14368   = eax both
14368   = eax fstsw nowait
14368   = eax fstsw wait
15392   = eax fxam+fstsw ax

--- ok ---
Well Microsoft, here's another nice mess you've gotten us into.

dedndave

prescott...
Intel(R) Pentium(R) 4 CPU 3.00GHz (SSE3)

8340    cycles for 1000 * both
8735    cycles for 1000 * fstsw nowait
8646    cycles for 1000 * fstsw wait
6103    cycles for 1000 * fxam+fstsw ax

8304    cycles for 1000 * both
8640    cycles for 1000 * fstsw nowait
8962    cycles for 1000 * fstsw wait
6161    cycles for 1000 * fxam+fstsw ax

8333    cycles for 1000 * both
8646    cycles for 1000 * fstsw nowait
8704    cycles for 1000 * fstsw wait
6105    cycles for 1000 * fxam+fstsw ax

8332    cycles for 1000 * both
8647    cycles for 1000 * fstsw nowait
8954    cycles for 1000 * fstsw wait
6098    cycles for 1000 * fxam+fstsw ax

8800    cycles for 1000 * both
8643    cycles for 1000 * fstsw nowait
8710    cycles for 1000 * fstsw wait
6105    cycles for 1000 * fxam+fstsw ax

13      bytes for both
11      bytes for fstsw nowait
12      bytes for fstsw wait
8       bytes for fxam+fstsw ax

14368   = eax both
14368   = eax fstsw nowait
14368   = eax fstsw wait
15392   = eax fxam+fstsw ax

jj2007

Thanks a lot :t

For the curious, here is the application. When displaying a float with maximum precision, Str$() uses an intermediate QWORD integer which gets converted with a variant of drizz' U64ToStr algo. That works fine for numbers below 9223372036854775807 - which means you can get 19 digits precision for 92.23% of all numbers. But I had to catch somehow the remaining 7.7%, and that's where fnstsw ax has become helpful, as it catches the invalid operation when trying to fistp a number above the magic value.

      fld st                ; create a copy for testing overflow
      fistp f2sTmp64        ; qword to mem - will choke for numbers higher than 922...
      fnstsw ax             ; pop flags
      test al, 1
      .if !Zero?            ; overflow detected, qword is above 9.22x
            fclex           ; clear flag
            dec edi         ; decrease precision by one digit
            fdiv f2s0dot1   ; divide by exactly 10.0 (slow but very rare)
            fistp f2sTmp64  ; and pop it again
      .else
            fstp st         ; correct fpu
      .endif