The MASM Forum

General => The Laboratory => Topic started by: MichaelW on April 19, 2014, 02:26:49 PM

Title: Floating-point divides and square roots
Post by: MichaelW on April 19, 2014, 02:26:49 PM
The attachment contains a test of floating-point divides and square roots, using FPU and SSE2 code, with optimizations as detailed in Agner Fog's optimizing_assembly.pdf, available  here (http://www.agner.org/optimize/).

I added the long delay before the timing code because my test system is running MSE and it (or possibly something else) is writing to the disk ~5 seconds after the app is launched, and this would otherwise disturb the cycle counts.

Typical results on my P4 Northwood system:

fdiv PC = 64-bit                         4.400243013365736
fdiv PC = 53-bit                         4.400243013365736
fdiv PC = 24-bit                         4.400242805480957
divsd                                    4.400243013365736
rcpss+mulss (12-bit precision)           4.399655818939209
rcpss+mulss extended to 23-bit precision 4.400242805480957

fsqrt PC = 64-bit                        2.330686594117708
fsqrt PC = 53-bit                        2.330686594117708
fsqrt PC = 24-bit                        2.330686569213867
sqrtsd                                   2.330686594117708

fdiv PC = 64-bit                         330 cycles
fdiv PC = 53-bit                         294 cycles
fdiv PC = 24-bit                         172 cycles
divsd                                    294 cycles
rcpss+mulss (12-bit precision)           68 cycles
rcpss+mulss extended to 23-bit precision 194 cycles

fsqrt PC = 64-bit                        330 cycles
fsqrt PC = 53-bit                        290 cycles
fsqrt PC = 24-bit                        170 cycles
sqrtsd                                   295 cycles


It looks like the 53-bit precision fdiv and fsqrt might be running on the same execution unit as divsd and sqrtsd.
Title: Re: Floating-point divides and square roots
Post by: jj2007 on April 19, 2014, 05:02:08 PM
Celeron M results. And indeed, I have yet to see a CPU that behaves as claimed by the fans of SIMD instructions, i.e. "faster and better than the old obsolete FPU".

I like SSE2+ for other purposes, but it's very difficult to beat the FPU ;-)

fdiv PC = 64-bit                         4.400243013365736
fdiv PC = 53-bit                         4.400243013365736
fdiv PC = 24-bit                         4.400242805480957
divsd                                    4.400243013365736
rcpss+mulss (12-bit precision)           4.399655818939209
rcpss+mulss extended to 23-bit precision 4.400242805480957

fsqrt PC = 64-bit                        2.330686594117708
fsqrt PC = 53-bit                        2.330686594117708
fsqrt PC = 24-bit                        2.330686569213867
sqrtsd                                   2.330686594117708

fdiv PC = 64-bit                         290 cycles
fdiv PC = 53-bit                         242 cycles
fdiv PC = 24-bit                         130 cycles
divsd                                    243 cycles
rcpss+mulss (12-bit precision)           50 cycles
rcpss+mulss extended to 23-bit precision 139 cycles

fsqrt PC = 64-bit                        539 cycles
fsqrt PC = 53-bit                        450 cycles
fsqrt PC = 24-bit                        218 cycles
sqrtsd                                   451 cycles


@Siekmanski: Sure?
fdiv PC = 53-bit                         106 cycles
divsd                                    106 cycles

fsqrt PC = 53-bit                        107 cycles
sqrtsd                                   153 cycles

Title: Re: Floating-point divides and square roots
Post by: Siekmanski on April 19, 2014, 05:03:52 PM
Results on my system,

OS        : Windows 8.1 (v6.2.9200)
Processor : (12x) Intel(R) Core(TM) i7-4930K CPU @ 3.40GHz

fdiv PC = 64-bit                         4.400243013365736
fdiv PC = 53-bit                         4.400243013365736
fdiv PC = 24-bit                         4.400242805480957
divsd                                    4.400243013365736
rcpss+mulss (12-bit precision)           4.399655818939209
rcpss+mulss extended to 23-bit precision 4.400242805480957

fsqrt PC = 64-bit                        2.330686594117708
fsqrt PC = 53-bit                        2.330686594117708
fsqrt PC = 24-bit                        2.330686569213867
sqrtsd                                   2.330686594117708

fdiv PC = 64-bit                         196 cycles
fdiv PC = 53-bit                         106 cycles
fdiv PC = 24-bit                         50 cycles
divsd                                    106 cycles
rcpss+mulss (12-bit precision)           73 cycles
rcpss+mulss extended to 23-bit precision 178 cycles

fsqrt PC = 64-bit                        130 cycles
fsqrt PC = 53-bit                        107 cycles
fsqrt PC = 24-bit                        50 cycles
sqrtsd                                   153 cycles


Quote@Siekmanski: Sure?


Yes, sqrtsd seems to be slow on my system


fdiv PC = 64-bit                         142 cycles ( varies between 142 and 202 cycles at each execution )
fdiv PC = 53-bit                         106 cycles
fdiv PC = 24-bit                         50 cycles
divsd                                    106 cycles
rcpss+mulss (12-bit precision)           74 cycles
rcpss+mulss extended to 23-bit precision 178 cycles

fsqrt PC = 64-bit                        130 cycles
fsqrt PC = 53-bit                        106 cycles
fsqrt PC = 24-bit                        50 cycles
sqrtsd                                   155 cycles
Title: Re: Floating-point divides and square roots
Post by: sinsi on April 19, 2014, 05:16:10 PM
Win 8.1 x64, i7 3770K

fdiv PC = 64-bit                         4.400243013365736
fdiv PC = 53-bit                         4.400243013365736
fdiv PC = 24-bit                         4.400242805480957
divsd                                    4.400243013365736
rcpss+mulss (12-bit precision)           4.399655818939209
rcpss+mulss extended to 23-bit precision 4.400242805480957

fsqrt PC = 64-bit                        2.330686594117708
fsqrt PC = 53-bit                        2.330686594117708
fsqrt PC = 24-bit                        2.330686569213867
sqrtsd                                   2.330686594117708

fdiv PC = 64-bit                         128 cycles
fdiv PC = 53-bit                         98 cycles
fdiv PC = 24-bit                         47 cycles
divsd                                    99 cycles
rcpss+mulss (12-bit precision)           69 cycles
rcpss+mulss extended to 23-bit precision 164 cycles

fsqrt PC = 64-bit                        121 cycles
fsqrt PC = 53-bit                        98 cycles
fsqrt PC = 24-bit                        48 cycles
sqrtsd                                   143 cycles

Title: Re: Floating-point divides and square roots
Post by: jj2007 on April 19, 2014, 08:16:28 PM
Intel(R) Core(TM) i5-2450M CPU @ 2.50GHz
fdiv PC = 64-bit                         266 cycles
fdiv PC = 53-bit                         169 cycles
fdiv PC = 24-bit                         105 cycles
divsd                                    169 cycles
rcpss+mulss (12-bit precision)           59 cycles
rcpss+mulss extended to 23-bit precision 143 cycles

fsqrt PC = 64-bit                        152 cycles
fsqrt PC = 53-bit                        131 cycles
fsqrt PC = 24-bit                        85 cycles
sqrtsd                                   161 cycles


It seems the whole iFamily is affected ;-)
Title: Re: Floating-point divides and square roots
Post by: Gunther on April 19, 2014, 09:14:17 PM
Michael,

the results:

fdiv PC = 64-bit                         4.400243013365736
fdiv PC = 53-bit                         4.400243013365736
fdiv PC = 24-bit                         4.400242805480957
divsd                                    4.400243013365736
rcpss+mulss (12-bit precision)           4.399655818939209
rcpss+mulss extended to 23-bit precision 4.400242805480957

fsqrt PC = 64-bit                        2.330686594117708
fsqrt PC = 53-bit                        2.330686594117708
fsqrt PC = 24-bit                        2.330686569213867
sqrtsd                                   2.330686594117708

fdiv PC = 64-bit                         194 cycles
fdiv PC = 53-bit                         93 cycles
fdiv PC = 24-bit                         43 cycles
divsd                                    93 cycles
rcpss+mulss (12-bit precision)           65 cycles
rcpss+mulss extended to 23-bit precision 156 cycles

fsqrt PC = 64-bit                        114 cycles
fsqrt PC = 53-bit                        93 cycles
fsqrt PC = 24-bit                        43 cycles
sqrtsd                                   135 cycles

Press any key to continue ...


Gunther
Title: Re: Floating-point divides and square roots
Post by: FORTRANS on April 20, 2014, 12:28:14 AM
Hi Michael,

Press any key to continue ...
fdiv PC = 64-bit                         4.400243013365736
fdiv PC = 53-bit                         4.400243013365736
fdiv PC = 24-bit                         4.400242805480957
divsd                                    4.400243013365736
rcpss+mulss (12-bit precision)           4.399655818939209
rcpss+mulss extended to 23-bit precision 4.400242805480957

fsqrt PC = 64-bit                        2.330686594117708
fsqrt PC = 53-bit                        2.330686594117708
fsqrt PC = 24-bit                        2.330686569213867
sqrtsd                                   2.330686594117708

fdiv PC = 64-bit                         293 cycles
fdiv PC = 53-bit                         244 cycles
fdiv PC = 24-bit                         131 cycles
divsd                                    246 cycles
rcpss+mulss (12-bit precision)           51 cycles
rcpss+mulss extended to 23-bit precision 140 cycles

fsqrt PC = 64-bit                        545 cycles
fsqrt PC = 53-bit                        455 cycles
fsqrt PC = 24-bit                        220 cycles
sqrtsd                                   456 cycles


Processor   x86 Family 6 Model 13 Stepping 6 GenuineIntel ~1694 Mhz
GenuineIntel
Intel(R) Pentium(R) M processor 1.70GHz


Steve N.
Title: Re: Floating-point divides and square roots
Post by: dedndave on April 20, 2014, 03:32:30 AM
P4 prescott w/htt, XP SP3
fdiv PC = 64-bit                         4.400243013365736
fdiv PC = 53-bit                         4.400243013365736
fdiv PC = 24-bit                         4.400242805480957
divsd                                    4.400243013365736
rcpss+mulss (12-bit precision)           4.399655818939209
rcpss+mulss extended to 23-bit precision 4.400242805480957

fsqrt PC = 64-bit                        2.330686594117708
fsqrt PC = 53-bit                        2.330686594117708
fsqrt PC = 24-bit                        2.330686569213867
sqrtsd                                   2.330686594117708

fdiv PC = 64-bit                         361 cycles
fdiv PC = 53-bit                         319 cycles
fdiv PC = 24-bit                         255 cycles
divsd                                    319 cycles
rcpss+mulss (12-bit precision)           84 cycles
rcpss+mulss extended to 23-bit precision 242 cycles

fsqrt PC = 64-bit                        359 cycles
fsqrt PC = 53-bit                        319 cycles
fsqrt PC = 24-bit                        255 cycles
sqrtsd                                   319 cycles
Title: Re: Floating-point divides and square roots
Post by: sinsi on April 20, 2014, 02:13:56 PM
AMD A10-7850K APU @ 3.70GHz, Windows 7 Ultimate x64

fdiv PC = 64-bit                         4.400243013365736
fdiv PC = 53-bit                         4.400243013365736
fdiv PC = 24-bit                         4.400242805480957
divsd                                    4.400243013365736
rcpss+mulss (12-bit precision)           4.400319099426270
rcpss+mulss extended to 23-bit precision 4.400242328643799

fsqrt PC = 64-bit                        2.330686594117708
fsqrt PC = 53-bit                        2.330686594117708
fsqrt PC = 24-bit                        2.330686569213867
sqrtsd                                   2.330686594117708

fdiv PC = 64-bit                         117 cycles
fdiv PC = 53-bit                         99 cycles
fdiv PC = 24-bit                         41 cycles
divsd                                    98 cycles
rcpss+mulss (12-bit precision)           9 cycles
rcpss+mulss extended to 23-bit precision 146 cycles

fsqrt PC = 64-bit                        162 cycles
fsqrt PC = 53-bit                        140 cycles
fsqrt PC = 24-bit                        64 cycles
sqrtsd                                   319 cycles

Title: Re: Floating-point divides and square roots
Post by: MichaelW on April 20, 2014, 03:10:42 PM
For the AMD the rcpss+mulss result is different than it is for the Intel processors tested here, and the instructions are much faster.

I was thinking that the slowness of sqrtsd might be a side effect of it sharing circuitry with sqrtpd, or something similar, but now that I add a test:

    counter_begin LOOP_COUNT, HIGH_PRIORITY_CLASS, THREAD_PRIORITY_TIME_CRITICAL
        REPEAT 8
            sqrtpd      xmm1, d8
        ENDM
    counter_end
    printf("sqrtpd                                   %d cycles\n\n",eax)


I get 545 cycles versus 294 cycles for sqrtsd.
Title: Re: Floating-point divides and square roots
Post by: Gunther on April 20, 2014, 09:02:36 PM
Michael,

I've an old AMD box elsewhere. Should I test your application with that?

Gunther
Title: Re: Floating-point divides and square roots
Post by: MichaelW on April 20, 2014, 09:58:29 PM
QuoteShould I test your application with that?

Yes, if it is not too much trouble. While it did not support SSE2, IIRC the Athlon had a faster FPU than the Intel part, a P4 at the time.
Title: Re: Floating-point divides and square roots
Post by: Gunther on April 21, 2014, 12:14:39 AM
Michael,

Quote from: MichaelW on April 20, 2014, 09:58:29 PM
Yes, if it is not too much trouble. While it did not support SSE2, IIRC the Athlon had a faster FPU than the Intel part, a P4 at the time.

no big deal. Here are the results with the AMD Athlon X2, Dual-Core TK-57, 1.9 GHz:

Microsoft Windows XP [Version 5.1.2600]
(C) Copyright 1985-2001 Microsoft Corp.

c:\tasm\work>test
fdiv PC = 64-bit                         4.400243013365736
fdiv PC = 53-bit                         4.400243013365736
fdiv PC = 24-bit                         4.400242805480957
divsd                                    4.400243013365736
rcpss+mulss (12-bit precision)           4.400319099426270
rcpss+mulss extended to 23-bit precision 4.400242328643799

fsqrt PC = 64-bit                        2.330686594117708
fsqrt PC = 53-bit                        2.330686594117708
fsqrt PC = 24-bit                        2.330686569213867
sqrtsd                                   2.330686594117708

fdiv PC = 64-bit                         169 cycles
fdiv PC = 53-bit                         136 cycles
fdiv PC = 24-bit                         104 cycles
divsd                                    132 cycles
rcpss+mulss (12-bit precision)           49 cycles
rcpss+mulss extended to 23-bit precision 146 cycles

fsqrt PC = 64-bit                        256 cycles
fsqrt PC = 53-bit                        192 cycles
fsqrt PC = 24-bit                        128 cycles
sqrtsd                                   185 cycles

Press any key to continue ...


Gunther
Title: Re: Floating-point divides and square roots
Post by: hutch-- on April 21, 2014, 01:58:15 PM
This is the result on my old 3 gig Core 2 quad.


fdiv PC = 64-bit                         4.400243013365736
fdiv PC = 53-bit                         4.400243013365736
fdiv PC = 24-bit                         4.400242805480957
divsd                                    4.400243013365736
rcpss+mulss (12-bit precision)           4.399655818939209
rcpss+mulss extended to 23-bit precision 4.400242805480957

fsqrt PC = 64-bit                        2.330686594117708
fsqrt PC = 53-bit                        2.330686594117708
fsqrt PC = 24-bit                        2.330686569213867
sqrtsd                                   2.330686594117708

fdiv PC = 64-bit                         174 cycles
fdiv PC = 53-bit                         158 cycles
fdiv PC = 24-bit                         90 cycles
divsd                                    157 cycles
rcpss+mulss (12-bit precision)           50 cycles
rcpss+mulss extended to 23-bit precision 138 cycles

fsqrt PC = 64-bit                        171 cycles
fsqrt PC = 53-bit                        147 cycles
fsqrt PC = 24-bit                        90 cycles
sqrtsd                                   154 cycles

Press any key to continue ...