News:

Masm32 SDK description, downloads and other helpful links
Message to All Guests

Main Menu

Floating-point divides and square roots

Started by MichaelW, April 19, 2014, 02:26:49 PM

Previous topic - Next topic

MichaelW

The attachment contains a test of floating-point divides and square roots, using FPU and SSE2 code, with optimizations as detailed in Agner Fog's optimizing_assembly.pdf, available here.

I added the long delay before the timing code because my test system is running MSE and it (or possibly something else) is writing to the disk ~5 seconds after the app is launched, and this would otherwise disturb the cycle counts.

Typical results on my P4 Northwood system:

fdiv PC = 64-bit                         4.400243013365736
fdiv PC = 53-bit                         4.400243013365736
fdiv PC = 24-bit                         4.400242805480957
divsd                                    4.400243013365736
rcpss+mulss (12-bit precision)           4.399655818939209
rcpss+mulss extended to 23-bit precision 4.400242805480957

fsqrt PC = 64-bit                        2.330686594117708
fsqrt PC = 53-bit                        2.330686594117708
fsqrt PC = 24-bit                        2.330686569213867
sqrtsd                                   2.330686594117708

fdiv PC = 64-bit                         330 cycles
fdiv PC = 53-bit                         294 cycles
fdiv PC = 24-bit                         172 cycles
divsd                                    294 cycles
rcpss+mulss (12-bit precision)           68 cycles
rcpss+mulss extended to 23-bit precision 194 cycles

fsqrt PC = 64-bit                        330 cycles
fsqrt PC = 53-bit                        290 cycles
fsqrt PC = 24-bit                        170 cycles
sqrtsd                                   295 cycles


It looks like the 53-bit precision fdiv and fsqrt might be running on the same execution unit as divsd and sqrtsd.
Well Microsoft, here's another nice mess you've gotten us into.

jj2007

Celeron M results. And indeed, I have yet to see a CPU that behaves as claimed by the fans of SIMD instructions, i.e. "faster and better than the old obsolete FPU".

I like SSE2+ for other purposes, but it's very difficult to beat the FPU ;-)

fdiv PC = 64-bit                         4.400243013365736
fdiv PC = 53-bit                         4.400243013365736
fdiv PC = 24-bit                         4.400242805480957
divsd                                    4.400243013365736
rcpss+mulss (12-bit precision)           4.399655818939209
rcpss+mulss extended to 23-bit precision 4.400242805480957

fsqrt PC = 64-bit                        2.330686594117708
fsqrt PC = 53-bit                        2.330686594117708
fsqrt PC = 24-bit                        2.330686569213867
sqrtsd                                   2.330686594117708

fdiv PC = 64-bit                         290 cycles
fdiv PC = 53-bit                         242 cycles
fdiv PC = 24-bit                         130 cycles
divsd                                    243 cycles
rcpss+mulss (12-bit precision)           50 cycles
rcpss+mulss extended to 23-bit precision 139 cycles

fsqrt PC = 64-bit                        539 cycles
fsqrt PC = 53-bit                        450 cycles
fsqrt PC = 24-bit                        218 cycles
sqrtsd                                   451 cycles


@Siekmanski: Sure?
fdiv PC = 53-bit                         106 cycles
divsd                                    106 cycles

fsqrt PC = 53-bit                        107 cycles
sqrtsd                                   153 cycles


Siekmanski

Results on my system,

OS        : Windows 8.1 (v6.2.9200)
Processor : (12x) Intel(R) Core(TM) i7-4930K CPU @ 3.40GHz

fdiv PC = 64-bit                         4.400243013365736
fdiv PC = 53-bit                         4.400243013365736
fdiv PC = 24-bit                         4.400242805480957
divsd                                    4.400243013365736
rcpss+mulss (12-bit precision)           4.399655818939209
rcpss+mulss extended to 23-bit precision 4.400242805480957

fsqrt PC = 64-bit                        2.330686594117708
fsqrt PC = 53-bit                        2.330686594117708
fsqrt PC = 24-bit                        2.330686569213867
sqrtsd                                   2.330686594117708

fdiv PC = 64-bit                         196 cycles
fdiv PC = 53-bit                         106 cycles
fdiv PC = 24-bit                         50 cycles
divsd                                    106 cycles
rcpss+mulss (12-bit precision)           73 cycles
rcpss+mulss extended to 23-bit precision 178 cycles

fsqrt PC = 64-bit                        130 cycles
fsqrt PC = 53-bit                        107 cycles
fsqrt PC = 24-bit                        50 cycles
sqrtsd                                   153 cycles


Quote@Siekmanski: Sure?


Yes, sqrtsd seems to be slow on my system


fdiv PC = 64-bit                         142 cycles ( varies between 142 and 202 cycles at each execution )
fdiv PC = 53-bit                         106 cycles
fdiv PC = 24-bit                         50 cycles
divsd                                    106 cycles
rcpss+mulss (12-bit precision)           74 cycles
rcpss+mulss extended to 23-bit precision 178 cycles

fsqrt PC = 64-bit                        130 cycles
fsqrt PC = 53-bit                        106 cycles
fsqrt PC = 24-bit                        50 cycles
sqrtsd                                   155 cycles
Creative coders use backward thinking techniques as a strategy.

sinsi

Win 8.1 x64, i7 3770K

fdiv PC = 64-bit                         4.400243013365736
fdiv PC = 53-bit                         4.400243013365736
fdiv PC = 24-bit                         4.400242805480957
divsd                                    4.400243013365736
rcpss+mulss (12-bit precision)           4.399655818939209
rcpss+mulss extended to 23-bit precision 4.400242805480957

fsqrt PC = 64-bit                        2.330686594117708
fsqrt PC = 53-bit                        2.330686594117708
fsqrt PC = 24-bit                        2.330686569213867
sqrtsd                                   2.330686594117708

fdiv PC = 64-bit                         128 cycles
fdiv PC = 53-bit                         98 cycles
fdiv PC = 24-bit                         47 cycles
divsd                                    99 cycles
rcpss+mulss (12-bit precision)           69 cycles
rcpss+mulss extended to 23-bit precision 164 cycles

fsqrt PC = 64-bit                        121 cycles
fsqrt PC = 53-bit                        98 cycles
fsqrt PC = 24-bit                        48 cycles
sqrtsd                                   143 cycles


jj2007

Intel(R) Core(TM) i5-2450M CPU @ 2.50GHz
fdiv PC = 64-bit                         266 cycles
fdiv PC = 53-bit                         169 cycles
fdiv PC = 24-bit                         105 cycles
divsd                                    169 cycles
rcpss+mulss (12-bit precision)           59 cycles
rcpss+mulss extended to 23-bit precision 143 cycles

fsqrt PC = 64-bit                        152 cycles
fsqrt PC = 53-bit                        131 cycles
fsqrt PC = 24-bit                        85 cycles
sqrtsd                                   161 cycles


It seems the whole iFamily is affected ;-)

Gunther

Michael,

the results:

fdiv PC = 64-bit                         4.400243013365736
fdiv PC = 53-bit                         4.400243013365736
fdiv PC = 24-bit                         4.400242805480957
divsd                                    4.400243013365736
rcpss+mulss (12-bit precision)           4.399655818939209
rcpss+mulss extended to 23-bit precision 4.400242805480957

fsqrt PC = 64-bit                        2.330686594117708
fsqrt PC = 53-bit                        2.330686594117708
fsqrt PC = 24-bit                        2.330686569213867
sqrtsd                                   2.330686594117708

fdiv PC = 64-bit                         194 cycles
fdiv PC = 53-bit                         93 cycles
fdiv PC = 24-bit                         43 cycles
divsd                                    93 cycles
rcpss+mulss (12-bit precision)           65 cycles
rcpss+mulss extended to 23-bit precision 156 cycles

fsqrt PC = 64-bit                        114 cycles
fsqrt PC = 53-bit                        93 cycles
fsqrt PC = 24-bit                        43 cycles
sqrtsd                                   135 cycles

Press any key to continue ...


Gunther
You have to know the facts before you can distort them.

FORTRANS

Hi Michael,

Press any key to continue ...
fdiv PC = 64-bit                         4.400243013365736
fdiv PC = 53-bit                         4.400243013365736
fdiv PC = 24-bit                         4.400242805480957
divsd                                    4.400243013365736
rcpss+mulss (12-bit precision)           4.399655818939209
rcpss+mulss extended to 23-bit precision 4.400242805480957

fsqrt PC = 64-bit                        2.330686594117708
fsqrt PC = 53-bit                        2.330686594117708
fsqrt PC = 24-bit                        2.330686569213867
sqrtsd                                   2.330686594117708

fdiv PC = 64-bit                         293 cycles
fdiv PC = 53-bit                         244 cycles
fdiv PC = 24-bit                         131 cycles
divsd                                    246 cycles
rcpss+mulss (12-bit precision)           51 cycles
rcpss+mulss extended to 23-bit precision 140 cycles

fsqrt PC = 64-bit                        545 cycles
fsqrt PC = 53-bit                        455 cycles
fsqrt PC = 24-bit                        220 cycles
sqrtsd                                   456 cycles


Processor   x86 Family 6 Model 13 Stepping 6 GenuineIntel ~1694 Mhz
GenuineIntel
Intel(R) Pentium(R) M processor 1.70GHz


Steve N.

dedndave

P4 prescott w/htt, XP SP3
fdiv PC = 64-bit                         4.400243013365736
fdiv PC = 53-bit                         4.400243013365736
fdiv PC = 24-bit                         4.400242805480957
divsd                                    4.400243013365736
rcpss+mulss (12-bit precision)           4.399655818939209
rcpss+mulss extended to 23-bit precision 4.400242805480957

fsqrt PC = 64-bit                        2.330686594117708
fsqrt PC = 53-bit                        2.330686594117708
fsqrt PC = 24-bit                        2.330686569213867
sqrtsd                                   2.330686594117708

fdiv PC = 64-bit                         361 cycles
fdiv PC = 53-bit                         319 cycles
fdiv PC = 24-bit                         255 cycles
divsd                                    319 cycles
rcpss+mulss (12-bit precision)           84 cycles
rcpss+mulss extended to 23-bit precision 242 cycles

fsqrt PC = 64-bit                        359 cycles
fsqrt PC = 53-bit                        319 cycles
fsqrt PC = 24-bit                        255 cycles
sqrtsd                                   319 cycles

sinsi

AMD A10-7850K APU @ 3.70GHz, Windows 7 Ultimate x64

fdiv PC = 64-bit                         4.400243013365736
fdiv PC = 53-bit                         4.400243013365736
fdiv PC = 24-bit                         4.400242805480957
divsd                                    4.400243013365736
rcpss+mulss (12-bit precision)           4.400319099426270
rcpss+mulss extended to 23-bit precision 4.400242328643799

fsqrt PC = 64-bit                        2.330686594117708
fsqrt PC = 53-bit                        2.330686594117708
fsqrt PC = 24-bit                        2.330686569213867
sqrtsd                                   2.330686594117708

fdiv PC = 64-bit                         117 cycles
fdiv PC = 53-bit                         99 cycles
fdiv PC = 24-bit                         41 cycles
divsd                                    98 cycles
rcpss+mulss (12-bit precision)           9 cycles
rcpss+mulss extended to 23-bit precision 146 cycles

fsqrt PC = 64-bit                        162 cycles
fsqrt PC = 53-bit                        140 cycles
fsqrt PC = 24-bit                        64 cycles
sqrtsd                                   319 cycles


MichaelW

For the AMD the rcpss+mulss result is different than it is for the Intel processors tested here, and the instructions are much faster.

I was thinking that the slowness of sqrtsd might be a side effect of it sharing circuitry with sqrtpd, or something similar, but now that I add a test:

    counter_begin LOOP_COUNT, HIGH_PRIORITY_CLASS, THREAD_PRIORITY_TIME_CRITICAL
        REPEAT 8
            sqrtpd      xmm1, d8
        ENDM
    counter_end
    printf("sqrtpd                                   %d cycles\n\n",eax)


I get 545 cycles versus 294 cycles for sqrtsd.
Well Microsoft, here's another nice mess you've gotten us into.

Gunther

Michael,

I've an old AMD box elsewhere. Should I test your application with that?

Gunther
You have to know the facts before you can distort them.

MichaelW

QuoteShould I test your application with that?

Yes, if it is not too much trouble. While it did not support SSE2, IIRC the Athlon had a faster FPU than the Intel part, a P4 at the time.
Well Microsoft, here's another nice mess you've gotten us into.

Gunther

Michael,

Quote from: MichaelW on April 20, 2014, 09:58:29 PM
Yes, if it is not too much trouble. While it did not support SSE2, IIRC the Athlon had a faster FPU than the Intel part, a P4 at the time.

no big deal. Here are the results with the AMD Athlon X2, Dual-Core TK-57, 1.9 GHz:

Microsoft Windows XP [Version 5.1.2600]
(C) Copyright 1985-2001 Microsoft Corp.

c:\tasm\work>test
fdiv PC = 64-bit                         4.400243013365736
fdiv PC = 53-bit                         4.400243013365736
fdiv PC = 24-bit                         4.400242805480957
divsd                                    4.400243013365736
rcpss+mulss (12-bit precision)           4.400319099426270
rcpss+mulss extended to 23-bit precision 4.400242328643799

fsqrt PC = 64-bit                        2.330686594117708
fsqrt PC = 53-bit                        2.330686594117708
fsqrt PC = 24-bit                        2.330686569213867
sqrtsd                                   2.330686594117708

fdiv PC = 64-bit                         169 cycles
fdiv PC = 53-bit                         136 cycles
fdiv PC = 24-bit                         104 cycles
divsd                                    132 cycles
rcpss+mulss (12-bit precision)           49 cycles
rcpss+mulss extended to 23-bit precision 146 cycles

fsqrt PC = 64-bit                        256 cycles
fsqrt PC = 53-bit                        192 cycles
fsqrt PC = 24-bit                        128 cycles
sqrtsd                                   185 cycles

Press any key to continue ...


Gunther
You have to know the facts before you can distort them.

hutch--

This is the result on my old 3 gig Core 2 quad.


fdiv PC = 64-bit                         4.400243013365736
fdiv PC = 53-bit                         4.400243013365736
fdiv PC = 24-bit                         4.400242805480957
divsd                                    4.400243013365736
rcpss+mulss (12-bit precision)           4.399655818939209
rcpss+mulss extended to 23-bit precision 4.400242805480957

fsqrt PC = 64-bit                        2.330686594117708
fsqrt PC = 53-bit                        2.330686594117708
fsqrt PC = 24-bit                        2.330686569213867
sqrtsd                                   2.330686594117708

fdiv PC = 64-bit                         174 cycles
fdiv PC = 53-bit                         158 cycles
fdiv PC = 24-bit                         90 cycles
divsd                                    157 cycles
rcpss+mulss (12-bit precision)           50 cycles
rcpss+mulss extended to 23-bit precision 138 cycles

fsqrt PC = 64-bit                        171 cycles
fsqrt PC = 53-bit                        147 cycles
fsqrt PC = 24-bit                        90 cycles
sqrtsd                                   154 cycles

Press any key to continue ...