The attachment contains a test of floating-point divides and square roots, using FPU and SSE2 code, with optimizations as detailed in Agner Fog's optimizing_assembly.pdf, available here (http://www.agner.org/optimize/).
I added the long delay before the timing code because my test system is running MSE and it (or possibly something else) is writing to the disk ~5 seconds after the app is launched, and this would otherwise disturb the cycle counts.
Typical results on my P4 Northwood system:
fdiv PC = 64-bit 4.400243013365736
fdiv PC = 53-bit 4.400243013365736
fdiv PC = 24-bit 4.400242805480957
divsd 4.400243013365736
rcpss+mulss (12-bit precision) 4.399655818939209
rcpss+mulss extended to 23-bit precision 4.400242805480957
fsqrt PC = 64-bit 2.330686594117708
fsqrt PC = 53-bit 2.330686594117708
fsqrt PC = 24-bit 2.330686569213867
sqrtsd 2.330686594117708
fdiv PC = 64-bit 330 cycles
fdiv PC = 53-bit 294 cycles
fdiv PC = 24-bit 172 cycles
divsd 294 cycles
rcpss+mulss (12-bit precision) 68 cycles
rcpss+mulss extended to 23-bit precision 194 cycles
fsqrt PC = 64-bit 330 cycles
fsqrt PC = 53-bit 290 cycles
fsqrt PC = 24-bit 170 cycles
sqrtsd 295 cycles
It looks like the 53-bit precision fdiv and fsqrt might be running on the same execution unit as divsd and sqrtsd.
Celeron M results. And indeed, I have yet to see a CPU that behaves as claimed by the fans of SIMD instructions, i.e. "faster and better than the old obsolete FPU".
I like SSE2+ for other purposes, but it's very difficult to beat the FPU ;-)
fdiv PC = 64-bit 4.400243013365736
fdiv PC = 53-bit 4.400243013365736
fdiv PC = 24-bit 4.400242805480957
divsd 4.400243013365736
rcpss+mulss (12-bit precision) 4.399655818939209
rcpss+mulss extended to 23-bit precision 4.400242805480957
fsqrt PC = 64-bit 2.330686594117708
fsqrt PC = 53-bit 2.330686594117708
fsqrt PC = 24-bit 2.330686569213867
sqrtsd 2.330686594117708
fdiv PC = 64-bit 290 cycles
fdiv PC = 53-bit 242 cycles
fdiv PC = 24-bit 130 cycles
divsd 243 cycles
rcpss+mulss (12-bit precision) 50 cycles
rcpss+mulss extended to 23-bit precision 139 cycles
fsqrt PC = 64-bit 539 cycles
fsqrt PC = 53-bit 450 cycles
fsqrt PC = 24-bit 218 cycles
sqrtsd 451 cycles
@Siekmanski: Sure?
fdiv PC = 53-bit 106 cycles
divsd 106 cycles
fsqrt PC = 53-bit 107 cycles
sqrtsd 153 cycles
Results on my system,
OS : Windows 8.1 (v6.2.9200)
Processor : (12x) Intel(R) Core(TM) i7-4930K CPU @ 3.40GHz
fdiv PC = 64-bit 4.400243013365736
fdiv PC = 53-bit 4.400243013365736
fdiv PC = 24-bit 4.400242805480957
divsd 4.400243013365736
rcpss+mulss (12-bit precision) 4.399655818939209
rcpss+mulss extended to 23-bit precision 4.400242805480957
fsqrt PC = 64-bit 2.330686594117708
fsqrt PC = 53-bit 2.330686594117708
fsqrt PC = 24-bit 2.330686569213867
sqrtsd 2.330686594117708
fdiv PC = 64-bit 196 cycles
fdiv PC = 53-bit 106 cycles
fdiv PC = 24-bit 50 cycles
divsd 106 cycles
rcpss+mulss (12-bit precision) 73 cycles
rcpss+mulss extended to 23-bit precision 178 cycles
fsqrt PC = 64-bit 130 cycles
fsqrt PC = 53-bit 107 cycles
fsqrt PC = 24-bit 50 cycles
sqrtsd 153 cycles
Quote@Siekmanski: Sure?
Yes, sqrtsd seems to be slow on my system
fdiv PC = 64-bit 142 cycles ( varies between 142 and 202 cycles at each execution )
fdiv PC = 53-bit 106 cycles
fdiv PC = 24-bit 50 cycles
divsd 106 cycles
rcpss+mulss (12-bit precision) 74 cycles
rcpss+mulss extended to 23-bit precision 178 cycles
fsqrt PC = 64-bit 130 cycles
fsqrt PC = 53-bit 106 cycles
fsqrt PC = 24-bit 50 cycles
sqrtsd 155 cycles
Win 8.1 x64, i7 3770K
fdiv PC = 64-bit 4.400243013365736
fdiv PC = 53-bit 4.400243013365736
fdiv PC = 24-bit 4.400242805480957
divsd 4.400243013365736
rcpss+mulss (12-bit precision) 4.399655818939209
rcpss+mulss extended to 23-bit precision 4.400242805480957
fsqrt PC = 64-bit 2.330686594117708
fsqrt PC = 53-bit 2.330686594117708
fsqrt PC = 24-bit 2.330686569213867
sqrtsd 2.330686594117708
fdiv PC = 64-bit 128 cycles
fdiv PC = 53-bit 98 cycles
fdiv PC = 24-bit 47 cycles
divsd 99 cycles
rcpss+mulss (12-bit precision) 69 cycles
rcpss+mulss extended to 23-bit precision 164 cycles
fsqrt PC = 64-bit 121 cycles
fsqrt PC = 53-bit 98 cycles
fsqrt PC = 24-bit 48 cycles
sqrtsd 143 cycles
Intel(R) Core(TM) i5-2450M CPU @ 2.50GHz
fdiv PC = 64-bit 266 cycles
fdiv PC = 53-bit 169 cycles
fdiv PC = 24-bit 105 cycles
divsd 169 cycles
rcpss+mulss (12-bit precision) 59 cycles
rcpss+mulss extended to 23-bit precision 143 cycles
fsqrt PC = 64-bit 152 cycles
fsqrt PC = 53-bit 131 cycles
fsqrt PC = 24-bit 85 cycles
sqrtsd 161 cycles
It seems the whole iFamily is affected ;-)
Michael,
the results:
fdiv PC = 64-bit 4.400243013365736
fdiv PC = 53-bit 4.400243013365736
fdiv PC = 24-bit 4.400242805480957
divsd 4.400243013365736
rcpss+mulss (12-bit precision) 4.399655818939209
rcpss+mulss extended to 23-bit precision 4.400242805480957
fsqrt PC = 64-bit 2.330686594117708
fsqrt PC = 53-bit 2.330686594117708
fsqrt PC = 24-bit 2.330686569213867
sqrtsd 2.330686594117708
fdiv PC = 64-bit 194 cycles
fdiv PC = 53-bit 93 cycles
fdiv PC = 24-bit 43 cycles
divsd 93 cycles
rcpss+mulss (12-bit precision) 65 cycles
rcpss+mulss extended to 23-bit precision 156 cycles
fsqrt PC = 64-bit 114 cycles
fsqrt PC = 53-bit 93 cycles
fsqrt PC = 24-bit 43 cycles
sqrtsd 135 cycles
Press any key to continue ...
Gunther
Hi Michael,
Press any key to continue ...
fdiv PC = 64-bit 4.400243013365736
fdiv PC = 53-bit 4.400243013365736
fdiv PC = 24-bit 4.400242805480957
divsd 4.400243013365736
rcpss+mulss (12-bit precision) 4.399655818939209
rcpss+mulss extended to 23-bit precision 4.400242805480957
fsqrt PC = 64-bit 2.330686594117708
fsqrt PC = 53-bit 2.330686594117708
fsqrt PC = 24-bit 2.330686569213867
sqrtsd 2.330686594117708
fdiv PC = 64-bit 293 cycles
fdiv PC = 53-bit 244 cycles
fdiv PC = 24-bit 131 cycles
divsd 246 cycles
rcpss+mulss (12-bit precision) 51 cycles
rcpss+mulss extended to 23-bit precision 140 cycles
fsqrt PC = 64-bit 545 cycles
fsqrt PC = 53-bit 455 cycles
fsqrt PC = 24-bit 220 cycles
sqrtsd 456 cycles
Processor x86 Family 6 Model 13 Stepping 6 GenuineIntel ~1694 Mhz
GenuineIntel
Intel(R) Pentium(R) M processor 1.70GHz
Steve N.
P4 prescott w/htt, XP SP3
fdiv PC = 64-bit 4.400243013365736
fdiv PC = 53-bit 4.400243013365736
fdiv PC = 24-bit 4.400242805480957
divsd 4.400243013365736
rcpss+mulss (12-bit precision) 4.399655818939209
rcpss+mulss extended to 23-bit precision 4.400242805480957
fsqrt PC = 64-bit 2.330686594117708
fsqrt PC = 53-bit 2.330686594117708
fsqrt PC = 24-bit 2.330686569213867
sqrtsd 2.330686594117708
fdiv PC = 64-bit 361 cycles
fdiv PC = 53-bit 319 cycles
fdiv PC = 24-bit 255 cycles
divsd 319 cycles
rcpss+mulss (12-bit precision) 84 cycles
rcpss+mulss extended to 23-bit precision 242 cycles
fsqrt PC = 64-bit 359 cycles
fsqrt PC = 53-bit 319 cycles
fsqrt PC = 24-bit 255 cycles
sqrtsd 319 cycles
AMD A10-7850K APU @ 3.70GHz, Windows 7 Ultimate x64
fdiv PC = 64-bit 4.400243013365736
fdiv PC = 53-bit 4.400243013365736
fdiv PC = 24-bit 4.400242805480957
divsd 4.400243013365736
rcpss+mulss (12-bit precision) 4.400319099426270
rcpss+mulss extended to 23-bit precision 4.400242328643799
fsqrt PC = 64-bit 2.330686594117708
fsqrt PC = 53-bit 2.330686594117708
fsqrt PC = 24-bit 2.330686569213867
sqrtsd 2.330686594117708
fdiv PC = 64-bit 117 cycles
fdiv PC = 53-bit 99 cycles
fdiv PC = 24-bit 41 cycles
divsd 98 cycles
rcpss+mulss (12-bit precision) 9 cycles
rcpss+mulss extended to 23-bit precision 146 cycles
fsqrt PC = 64-bit 162 cycles
fsqrt PC = 53-bit 140 cycles
fsqrt PC = 24-bit 64 cycles
sqrtsd 319 cycles
For the AMD the rcpss+mulss result is different than it is for the Intel processors tested here, and the instructions are much faster.
I was thinking that the slowness of sqrtsd might be a side effect of it sharing circuitry with sqrtpd, or something similar, but now that I add a test:
counter_begin LOOP_COUNT, HIGH_PRIORITY_CLASS, THREAD_PRIORITY_TIME_CRITICAL
REPEAT 8
sqrtpd xmm1, d8
ENDM
counter_end
printf("sqrtpd %d cycles\n\n",eax)
I get 545 cycles versus 294 cycles for sqrtsd.
Michael,
I've an old AMD box elsewhere. Should I test your application with that?
Gunther
QuoteShould I test your application with that?
Yes, if it is not too much trouble. While it did not support SSE2, IIRC the Athlon had a faster FPU than the Intel part, a P4 at the time.
Michael,
Quote from: MichaelW on April 20, 2014, 09:58:29 PM
Yes, if it is not too much trouble. While it did not support SSE2, IIRC the Athlon had a faster FPU than the Intel part, a P4 at the time.
no big deal. Here are the results with the AMD Athlon X2, Dual-Core TK-57, 1.9 GHz:
Microsoft Windows XP [Version 5.1.2600]
(C) Copyright 1985-2001 Microsoft Corp.
c:\tasm\work>test
fdiv PC = 64-bit 4.400243013365736
fdiv PC = 53-bit 4.400243013365736
fdiv PC = 24-bit 4.400242805480957
divsd 4.400243013365736
rcpss+mulss (12-bit precision) 4.400319099426270
rcpss+mulss extended to 23-bit precision 4.400242328643799
fsqrt PC = 64-bit 2.330686594117708
fsqrt PC = 53-bit 2.330686594117708
fsqrt PC = 24-bit 2.330686569213867
sqrtsd 2.330686594117708
fdiv PC = 64-bit 169 cycles
fdiv PC = 53-bit 136 cycles
fdiv PC = 24-bit 104 cycles
divsd 132 cycles
rcpss+mulss (12-bit precision) 49 cycles
rcpss+mulss extended to 23-bit precision 146 cycles
fsqrt PC = 64-bit 256 cycles
fsqrt PC = 53-bit 192 cycles
fsqrt PC = 24-bit 128 cycles
sqrtsd 185 cycles
Press any key to continue ...
Gunther
This is the result on my old 3 gig Core 2 quad.
fdiv PC = 64-bit 4.400243013365736
fdiv PC = 53-bit 4.400243013365736
fdiv PC = 24-bit 4.400242805480957
divsd 4.400243013365736
rcpss+mulss (12-bit precision) 4.399655818939209
rcpss+mulss extended to 23-bit precision 4.400242805480957
fsqrt PC = 64-bit 2.330686594117708
fsqrt PC = 53-bit 2.330686594117708
fsqrt PC = 24-bit 2.330686569213867
sqrtsd 2.330686594117708
fdiv PC = 64-bit 174 cycles
fdiv PC = 53-bit 158 cycles
fdiv PC = 24-bit 90 cycles
divsd 157 cycles
rcpss+mulss (12-bit precision) 50 cycles
rcpss+mulss extended to 23-bit precision 138 cycles
fsqrt PC = 64-bit 171 cycles
fsqrt PC = 53-bit 147 cycles
fsqrt PC = 24-bit 90 cycles
sqrtsd 154 cycles
Press any key to continue ...
in reference to post https://masm32.com/board/index.php?msg=136631 (https://masm32.com/board/index.php?msg=136631)
my PC: Processor Intel(R) Core(TM) i9-9900K CPU @ 3.60GHz, 3600 Mhz, 8 Core(s), 16 Logical Processor(s)
fdiv PC = 64-bit 4.400243013365736
fdiv PC = 53-bit 4.400243013365736
fdiv PC = 24-bit 4.400242805480957
divsd 4.400243013365736
rcpss+mulss (12-bit precision) 4.399655818939209
rcpss+mulss extended to 23-bit precision 4.400242805480957
fsqrt PC = 64-bit 2.330686594117708
fsqrt PC = 53-bit 2.330686594117708
fsqrt PC = 24-bit 2.330686569213867
sqrtsd 2.330686594117708
fdiv PC = 64-bit 23 cycles
fdiv PC = 53-bit 22 cycles
fdiv PC = 24-bit 14 cycles
divsd 21 cycles
rcpss+mulss (12-bit precision) 44 cycles
rcpss+mulss extended to 23-bit precision 123 cycles
fsqrt PC = 64-bit 39 cycles
fsqrt PC = 53-bit 31 cycles
fsqrt PC = 24-bit 17 cycles
sqrtsd 110 cycles
The desktop PC:
Intel(R) Core(TM) i7-7700 CPU @ 3.60GHz 3.60 GHz :smiley:
fdiv PC = 64-bit 4.400243013365736
fdiv PC = 53-bit 4.400243013365736
fdiv PC = 24-bit 4.400242805480957
divsd 4.400243013365736
rcpss+mulss (12-bit precision) 4.399655818939209
rcpss+mulss extended to 23-bit precision 4.400242805480957
fsqrt PC = 64-bit 2.330686594117708
fsqrt PC = 53-bit 2.330686594117708
fsqrt PC = 24-bit 2.330686569213867
sqrtsd 2.330686594117708
fdiv PC = 64-bit 27 cycles
fdiv PC = 53-bit 23 cycles
fdiv PC = 24-bit 16 cycles
divsd 23 cycles
rcpss+mulss (12-bit precision) 52 cycles
rcpss+mulss extended to 23-bit precision 136 cycles
fsqrt PC = 64-bit 43 cycles
fsqrt PC = 53-bit 37 cycles
fsqrt PC = 24-bit 17 cycles
sqrtsd 126 cycles
And the laptop
Intel(R) Celeron(R) N5105 @ 2.00GHz 2.00 GHz
fdiv PC = 64-bit 4.400243013365736
fdiv PC = 53-bit 4.400243013365736
fdiv PC = 24-bit 4.400242805480957
divsd 4.400243013365736
rcpss+mulss (12-bit precision) 4.399655818939209
rcpss+mulss extended to 23-bit precision 4.400242805480957
fsqrt PC = 64-bit 2.330686594117708
fsqrt PC = 53-bit 2.330686594117708
fsqrt PC = 24-bit 2.330686569213867
sqrtsd 2.330686594117708
fdiv PC = 64-bit 47 cycles
fdiv PC = 53-bit 42 cycles
fdiv PC = 24-bit 24 cycles
divsd 42 cycles
rcpss+mulss (12-bit precision) 42 cycles
rcpss+mulss extended to 23-bit precision 106 cycles
fsqrt PC = 64-bit 76 cycles
fsqrt PC = 53-bit 66 cycles
fsqrt PC = 24-bit 34 cycles
sqrtsd 94 cycles
:biggrin: