Author Topic: Floating-point divides and square roots  (Read 8024 times)

MichaelW

  • Global Moderator
  • Member
  • *****
  • Posts: 1209
Floating-point divides and square roots
« on: April 19, 2014, 02:26:49 PM »
The attachment contains a test of floating-point divides and square roots, using FPU and SSE2 code, with optimizations as detailed in Agner Fog’s optimizing_assembly.pdf, available here.

I added the long delay before the timing code because my test system is running MSE and it (or possibly something else) is writing to the disk ~5 seconds after the app is launched, and this would otherwise disturb the cycle counts.

Typical results on my P4 Northwood system:
Code: [Select]
fdiv PC = 64-bit                         4.400243013365736
fdiv PC = 53-bit                         4.400243013365736
fdiv PC = 24-bit                         4.400242805480957
divsd                                    4.400243013365736
rcpss+mulss (12-bit precision)           4.399655818939209
rcpss+mulss extended to 23-bit precision 4.400242805480957

fsqrt PC = 64-bit                        2.330686594117708
fsqrt PC = 53-bit                        2.330686594117708
fsqrt PC = 24-bit                        2.330686569213867
sqrtsd                                   2.330686594117708

fdiv PC = 64-bit                         330 cycles
fdiv PC = 53-bit                         294 cycles
fdiv PC = 24-bit                         172 cycles
divsd                                    294 cycles
rcpss+mulss (12-bit precision)           68 cycles
rcpss+mulss extended to 23-bit precision 194 cycles

fsqrt PC = 64-bit                        330 cycles
fsqrt PC = 53-bit                        290 cycles
fsqrt PC = 24-bit                        170 cycles
sqrtsd                                   295 cycles

It looks like the 53-bit precision fdiv and fsqrt might be running on the same execution unit as divsd and sqrtsd.
Well Microsoft, here’s another nice mess you’ve gotten us into.

jj2007

  • Member
  • *****
  • Posts: 11440
  • Assembler is fun ;-)
    • MasmBasic
Re: Floating-point divides and square roots
« Reply #1 on: April 19, 2014, 05:02:08 PM »
Celeron M results. And indeed, I have yet to see a CPU that behaves as claimed by the fans of SIMD instructions, i.e. "faster and better than the old obsolete FPU".

I like SSE2+ for other purposes, but it's very difficult to beat the FPU ;-)

Code: [Select]
fdiv PC = 64-bit                         4.400243013365736
fdiv PC = 53-bit                         4.400243013365736
fdiv PC = 24-bit                         4.400242805480957
divsd                                    4.400243013365736
rcpss+mulss (12-bit precision)           4.399655818939209
rcpss+mulss extended to 23-bit precision 4.400242805480957

fsqrt PC = 64-bit                        2.330686594117708
fsqrt PC = 53-bit                        2.330686594117708
fsqrt PC = 24-bit                        2.330686569213867
sqrtsd                                   2.330686594117708

fdiv PC = 64-bit                         290 cycles
fdiv PC = 53-bit                         242 cycles
fdiv PC = 24-bit                         130 cycles
divsd                                    243 cycles
rcpss+mulss (12-bit precision)           50 cycles
rcpss+mulss extended to 23-bit precision 139 cycles

fsqrt PC = 64-bit                        539 cycles
fsqrt PC = 53-bit                        450 cycles
fsqrt PC = 24-bit                        218 cycles
sqrtsd                                   451 cycles

@Siekmanski: Sure?
fdiv PC = 53-bit                         106 cycles
divsd                                    106 cycles

fsqrt PC = 53-bit                        107 cycles
sqrtsd                                   153 cycles


Siekmanski

  • Member
  • *****
  • Posts: 2360
Re: Floating-point divides and square roots
« Reply #2 on: April 19, 2014, 05:03:52 PM »
Results on my system,

OS        : Windows 8.1 (v6.2.9200)
Processor : (12x) Intel(R) Core(TM) i7-4930K CPU @ 3.40GHz

Code: [Select]
fdiv PC = 64-bit                         4.400243013365736
fdiv PC = 53-bit                         4.400243013365736
fdiv PC = 24-bit                         4.400242805480957
divsd                                    4.400243013365736
rcpss+mulss (12-bit precision)           4.399655818939209
rcpss+mulss extended to 23-bit precision 4.400242805480957

fsqrt PC = 64-bit                        2.330686594117708
fsqrt PC = 53-bit                        2.330686594117708
fsqrt PC = 24-bit                        2.330686569213867
sqrtsd                                   2.330686594117708

fdiv PC = 64-bit                         196 cycles
fdiv PC = 53-bit                         106 cycles
fdiv PC = 24-bit                         50 cycles
divsd                                    106 cycles
rcpss+mulss (12-bit precision)           73 cycles
rcpss+mulss extended to 23-bit precision 178 cycles

fsqrt PC = 64-bit                        130 cycles
fsqrt PC = 53-bit                        107 cycles
fsqrt PC = 24-bit                        50 cycles
sqrtsd                                   153 cycles

Quote
@Siekmanski: Sure?


Yes, sqrtsd seems to be slow on my system

Code: [Select]
fdiv PC = 64-bit                         142 cycles ( varies between 142 and 202 cycles at each execution )
fdiv PC = 53-bit                         106 cycles
fdiv PC = 24-bit                         50 cycles
divsd                                    106 cycles
rcpss+mulss (12-bit precision)           74 cycles
rcpss+mulss extended to 23-bit precision 178 cycles

fsqrt PC = 64-bit                        130 cycles
fsqrt PC = 53-bit                        106 cycles
fsqrt PC = 24-bit                        50 cycles
sqrtsd                                   155 cycles
Creative coders use backward thinking techniques as a strategy.

sinsi

  • Guest
Re: Floating-point divides and square roots
« Reply #3 on: April 19, 2014, 05:16:10 PM »
Win 8.1 x64, i7 3770K
Code: [Select]
fdiv PC = 64-bit                         4.400243013365736
fdiv PC = 53-bit                         4.400243013365736
fdiv PC = 24-bit                         4.400242805480957
divsd                                    4.400243013365736
rcpss+mulss (12-bit precision)           4.399655818939209
rcpss+mulss extended to 23-bit precision 4.400242805480957

fsqrt PC = 64-bit                        2.330686594117708
fsqrt PC = 53-bit                        2.330686594117708
fsqrt PC = 24-bit                        2.330686569213867
sqrtsd                                   2.330686594117708

fdiv PC = 64-bit                         128 cycles
fdiv PC = 53-bit                         98 cycles
fdiv PC = 24-bit                         47 cycles
divsd                                    99 cycles
rcpss+mulss (12-bit precision)           69 cycles
rcpss+mulss extended to 23-bit precision 164 cycles

fsqrt PC = 64-bit                        121 cycles
fsqrt PC = 53-bit                        98 cycles
fsqrt PC = 24-bit                        48 cycles
sqrtsd                                   143 cycles

jj2007

  • Member
  • *****
  • Posts: 11440
  • Assembler is fun ;-)
    • MasmBasic
Re: Floating-point divides and square roots
« Reply #4 on: April 19, 2014, 08:16:28 PM »
Intel(R) Core(TM) i5-2450M CPU @ 2.50GHz
fdiv PC = 64-bit                         266 cycles
fdiv PC = 53-bit                         169 cycles
fdiv PC = 24-bit                         105 cycles
divsd                                    169 cycles
rcpss+mulss (12-bit precision)           59 cycles
rcpss+mulss extended to 23-bit precision 143 cycles

fsqrt PC = 64-bit                        152 cycles
fsqrt PC = 53-bit                        131 cycles
fsqrt PC = 24-bit                        85 cycles
sqrtsd                                   161 cycles


It seems the whole iFamily is affected ;-)

Gunther

  • Member
  • *****
  • Posts: 3723
  • Forgive your enemies, but never forget their names
Re: Floating-point divides and square roots
« Reply #5 on: April 19, 2014, 09:14:17 PM »
Michael,

the results:
Code: [Select]
fdiv PC = 64-bit                         4.400243013365736
fdiv PC = 53-bit                         4.400243013365736
fdiv PC = 24-bit                         4.400242805480957
divsd                                    4.400243013365736
rcpss+mulss (12-bit precision)           4.399655818939209
rcpss+mulss extended to 23-bit precision 4.400242805480957

fsqrt PC = 64-bit                        2.330686594117708
fsqrt PC = 53-bit                        2.330686594117708
fsqrt PC = 24-bit                        2.330686569213867
sqrtsd                                   2.330686594117708

fdiv PC = 64-bit                         194 cycles
fdiv PC = 53-bit                         93 cycles
fdiv PC = 24-bit                         43 cycles
divsd                                    93 cycles
rcpss+mulss (12-bit precision)           65 cycles
rcpss+mulss extended to 23-bit precision 156 cycles

fsqrt PC = 64-bit                        114 cycles
fsqrt PC = 53-bit                        93 cycles
fsqrt PC = 24-bit                        43 cycles
sqrtsd                                   135 cycles

Press any key to continue ...

Gunther
Get your facts first, and then you can distort them.

FORTRANS

  • Member
  • *****
  • Posts: 1095
Re: Floating-point divides and square roots
« Reply #6 on: April 20, 2014, 12:28:14 AM »
Hi Michael,

Press any key to continue ...
fdiv PC = 64-bit                         4.400243013365736
fdiv PC = 53-bit                         4.400243013365736
fdiv PC = 24-bit                         4.400242805480957
divsd                                    4.400243013365736
rcpss+mulss (12-bit precision)           4.399655818939209
rcpss+mulss extended to 23-bit precision 4.400242805480957

fsqrt PC = 64-bit                        2.330686594117708
fsqrt PC = 53-bit                        2.330686594117708
fsqrt PC = 24-bit                        2.330686569213867
sqrtsd                                   2.330686594117708

fdiv PC = 64-bit                         293 cycles
fdiv PC = 53-bit                         244 cycles
fdiv PC = 24-bit                         131 cycles
divsd                                    246 cycles
rcpss+mulss (12-bit precision)           51 cycles
rcpss+mulss extended to 23-bit precision 140 cycles

fsqrt PC = 64-bit                        545 cycles
fsqrt PC = 53-bit                        455 cycles
fsqrt PC = 24-bit                        220 cycles
sqrtsd                                   456 cycles


Processor   x86 Family 6 Model 13 Stepping 6 GenuineIntel ~1694 Mhz
GenuineIntel
Intel(R) Pentium(R) M processor 1.70GHz


Steve N.

dedndave

  • Member
  • *****
  • Posts: 8829
  • Still using Abacus 2.0
    • DednDave
Re: Floating-point divides and square roots
« Reply #7 on: April 20, 2014, 03:32:30 AM »
P4 prescott w/htt, XP SP3
Code: [Select]
fdiv PC = 64-bit                         4.400243013365736
fdiv PC = 53-bit                         4.400243013365736
fdiv PC = 24-bit                         4.400242805480957
divsd                                    4.400243013365736
rcpss+mulss (12-bit precision)           4.399655818939209
rcpss+mulss extended to 23-bit precision 4.400242805480957

fsqrt PC = 64-bit                        2.330686594117708
fsqrt PC = 53-bit                        2.330686594117708
fsqrt PC = 24-bit                        2.330686569213867
sqrtsd                                   2.330686594117708

fdiv PC = 64-bit                         361 cycles
fdiv PC = 53-bit                         319 cycles
fdiv PC = 24-bit                         255 cycles
divsd                                    319 cycles
rcpss+mulss (12-bit precision)           84 cycles
rcpss+mulss extended to 23-bit precision 242 cycles

fsqrt PC = 64-bit                        359 cycles
fsqrt PC = 53-bit                        319 cycles
fsqrt PC = 24-bit                        255 cycles
sqrtsd                                   319 cycles

sinsi

  • Guest
Re: Floating-point divides and square roots
« Reply #8 on: April 20, 2014, 02:13:56 PM »
AMD A10-7850K APU @ 3.70GHz, Windows 7 Ultimate x64
Code: [Select]
fdiv PC = 64-bit                         4.400243013365736
fdiv PC = 53-bit                         4.400243013365736
fdiv PC = 24-bit                         4.400242805480957
divsd                                    4.400243013365736
rcpss+mulss (12-bit precision)           4.400319099426270
rcpss+mulss extended to 23-bit precision 4.400242328643799

fsqrt PC = 64-bit                        2.330686594117708
fsqrt PC = 53-bit                        2.330686594117708
fsqrt PC = 24-bit                        2.330686569213867
sqrtsd                                   2.330686594117708

fdiv PC = 64-bit                         117 cycles
fdiv PC = 53-bit                         99 cycles
fdiv PC = 24-bit                         41 cycles
divsd                                    98 cycles
rcpss+mulss (12-bit precision)           9 cycles
rcpss+mulss extended to 23-bit precision 146 cycles

fsqrt PC = 64-bit                        162 cycles
fsqrt PC = 53-bit                        140 cycles
fsqrt PC = 24-bit                        64 cycles
sqrtsd                                   319 cycles

MichaelW

  • Global Moderator
  • Member
  • *****
  • Posts: 1209
Re: Floating-point divides and square roots
« Reply #9 on: April 20, 2014, 03:10:42 PM »
For the AMD the rcpss+mulss result is different than it is for the Intel processors tested here, and the instructions are much faster.

I was thinking that the slowness of sqrtsd might be a side effect of it sharing circuitry with sqrtpd, or something similar, but now that I add a test:
Code: [Select]
    counter_begin LOOP_COUNT, HIGH_PRIORITY_CLASS, THREAD_PRIORITY_TIME_CRITICAL
        REPEAT 8
            sqrtpd      xmm1, d8
        ENDM
    counter_end
    printf("sqrtpd                                   %d cycles\n\n",eax)

I get 545 cycles versus 294 cycles for sqrtsd.
Well Microsoft, here’s another nice mess you’ve gotten us into.

Gunther

  • Member
  • *****
  • Posts: 3723
  • Forgive your enemies, but never forget their names
Re: Floating-point divides and square roots
« Reply #10 on: April 20, 2014, 09:02:36 PM »
Michael,

I've an old AMD box elsewhere. Should I test your application with that?

Gunther
Get your facts first, and then you can distort them.

MichaelW

  • Global Moderator
  • Member
  • *****
  • Posts: 1209
Re: Floating-point divides and square roots
« Reply #11 on: April 20, 2014, 09:58:29 PM »
Quote
Should I test your application with that?

Yes, if it is not too much trouble. While it did not support SSE2, IIRC the Athlon had a faster FPU than the Intel part, a P4 at the time.
Well Microsoft, here’s another nice mess you’ve gotten us into.

Gunther

  • Member
  • *****
  • Posts: 3723
  • Forgive your enemies, but never forget their names
Re: Floating-point divides and square roots
« Reply #12 on: April 21, 2014, 12:14:39 AM »
Michael,

Yes, if it is not too much trouble. While it did not support SSE2, IIRC the Athlon had a faster FPU than the Intel part, a P4 at the time.

no big deal. Here are the results with the AMD Athlon X2, Dual-Core TK-57, 1.9 GHz:
Code: [Select]
Microsoft Windows XP [Version 5.1.2600]
(C) Copyright 1985-2001 Microsoft Corp.

c:\tasm\work>test
fdiv PC = 64-bit                         4.400243013365736
fdiv PC = 53-bit                         4.400243013365736
fdiv PC = 24-bit                         4.400242805480957
divsd                                    4.400243013365736
rcpss+mulss (12-bit precision)           4.400319099426270
rcpss+mulss extended to 23-bit precision 4.400242328643799

fsqrt PC = 64-bit                        2.330686594117708
fsqrt PC = 53-bit                        2.330686594117708
fsqrt PC = 24-bit                        2.330686569213867
sqrtsd                                   2.330686594117708

fdiv PC = 64-bit                         169 cycles
fdiv PC = 53-bit                         136 cycles
fdiv PC = 24-bit                         104 cycles
divsd                                    132 cycles
rcpss+mulss (12-bit precision)           49 cycles
rcpss+mulss extended to 23-bit precision 146 cycles

fsqrt PC = 64-bit                        256 cycles
fsqrt PC = 53-bit                        192 cycles
fsqrt PC = 24-bit                        128 cycles
sqrtsd                                   185 cycles

Press any key to continue ...

Gunther
Get your facts first, and then you can distort them.

hutch--

  • Administrator
  • Member
  • ******
  • Posts: 8321
  • Mnemonic Driven API Grinder
    • The MASM32 SDK
Re: Floating-point divides and square roots
« Reply #13 on: April 21, 2014, 01:58:15 PM »
This is the result on my old 3 gig Core 2 quad.

Code: [Select]
fdiv PC = 64-bit                         4.400243013365736
fdiv PC = 53-bit                         4.400243013365736
fdiv PC = 24-bit                         4.400242805480957
divsd                                    4.400243013365736
rcpss+mulss (12-bit precision)           4.399655818939209
rcpss+mulss extended to 23-bit precision 4.400242805480957

fsqrt PC = 64-bit                        2.330686594117708
fsqrt PC = 53-bit                        2.330686594117708
fsqrt PC = 24-bit                        2.330686569213867
sqrtsd                                   2.330686594117708

fdiv PC = 64-bit                         174 cycles
fdiv PC = 53-bit                         158 cycles
fdiv PC = 24-bit                         90 cycles
divsd                                    157 cycles
rcpss+mulss (12-bit precision)           50 cycles
rcpss+mulss extended to 23-bit precision 138 cycles

fsqrt PC = 64-bit                        171 cycles
fsqrt PC = 53-bit                        147 cycles
fsqrt PC = 24-bit                        90 cycles
sqrtsd                                   154 cycles

Press any key to continue ...
hutch at movsd dot com
http://www.masm32.com    :biggrin:  :skrewy: