Normally, the FPU is not slower than equivalent SIMD code, but 1/x beats it:
Intel(R) Core(TM) i5-2450M CPU @ 2.50GHz (SSE4)
3168 cycles for 1000 * rcpss
13049 cycles for 1000 * fdiv
3167 cycles for 1000 * rcpss
13135 cycles for 1000 * fdiv
3181 cycles for 1000 * rcpss
13259 cycles for 1000 * fdiv
3189 cycles for 1000 * rcpss
13092 cycles for 1000 * fdiv
3156 cycles for 1000 * rcpss
13070 cycles for 1000 * fdiv
24 bytes for rcpss
23 bytes for fdiv
ST0 123453440.0000000000
ST0 123456792.0000000000
Of course, precision is lower; the expected value is 123456789.0
The source:
NameA equ rcpss ; assign a descriptive name here
TestA proc
mov ebx, AlgoLoops-1 ; loop 1000x
push 123456789
fild stack
fstp stack
pop eax
movd xmm0, eax
align 4
.Repeat
rcpss xmm0, xmm0
dec ebx
.Until Sign?
movd eax, xmm0
ret
TestA endp
align_64
NameB equ fdiv ; assign a descriptive name here
TestB proc
mov ebx, AlgoLoops-1 ; loop 1000x
push 123456789
fild stack
fstp stack
fld1
align 4
.Repeat
fld stack
fdiv ST, ST(1)
fstp stack
dec ebx
.Until Sign?
fstp st
pop eax
ret
TestB endp
Intel(R) Core(TM) i7-4930K CPU @ 3.40GHz (SSE4)
3868 cycles for 1000 * rcpss
15813 cycles for 1000 * fdiv
3868 cycles for 1000 * rcpss
15826 cycles for 1000 * fdiv
3874 cycles for 1000 * rcpss
15808 cycles for 1000 * fdiv
3875 cycles for 1000 * rcpss
15808 cycles for 1000 * fdiv
3868 cycles for 1000 * rcpss
15821 cycles for 1000 * fdiv
24 bytes for rcpss
23 bytes for fdiv
ST0 123453440.0000000000
ST0 123456792.0000000000
AMD A6-9220e RADEON R4, 5 COMPUTE CORES 2C+3G (SSE4)
2564 cycles for 1000 * rcpss
17517 cycles for 1000 * fdiv
2552 cycles for 1000 * rcpss
16796 cycles for 1000 * fdiv
2822 cycles for 1000 * rcpss
16735 cycles for 1000 * fdiv
2566 cycles for 1000 * rcpss
16124 cycles for 1000 * fdiv
2636 cycles for 1000 * rcpss
16808 cycles for 1000 * fdiv
24 bytes for rcpss
23 bytes for fdiv
ST0 123453440.0000000000
ST0 123456792.0000000000
--- ok ---
1.60 Ghz as usual
Intel(R) Core(TM) i3 CPU 540 @ 3.07GHz (SSE4)
928 cycles for 1000 * rcpss
12260 cycles for 1000 * fdiv
903 cycles for 1000 * rcpss
12142 cycles for 1000 * fdiv
889 cycles for 1000 * rcpss
12072 cycles for 1000 * fdiv
893 cycles for 1000 * rcpss
12114 cycles for 1000 * fdiv
894 cycles for 1000 * rcpss
12059 cycles for 1000 * fdiv
24 bytes for rcpss
23 bytes for fdiv
ST0 123453440.0000000000
ST0 123456792.0000000000
--- ok ---
It shouldn't be a surprise since you're executing three instructions each loop for the fpu vs. one instruction for simd
Hi,jj2007!
fild stack
at first I was even delighted with the non-standard appeal to the top of the FPU, but
tut_02.asm(8) : error A2006:undefined symbol : stack
(http://cyberstatic.net/images/smilies/scratch_one-s_head.gif)
signed equ sdword ptr
stack equ <DWord Ptr [esp]>
push 9 ; 10 iterations
.Repeat
... do something ...
dec stack
.Until Sign? || signed eax<0
pop edx
@JimG: The fdiv makes it slow. It seems rcpss uses a much faster algorithm.
Hi, jj2007!
stack equ <DWord Ptr [esp]>
Mille grazie! (http://cyberstatic.net/images/smilies/yes.gif)
Why dont we compare against rcpps and divps, divpd?, should be more useful results i think
Quote from: daydreamer on June 24, 2018, 08:41:32 PM
Why dont we compare against rcpps and divps, divpd?, should be more useful results i think
Intel(R) Core(TM) i5-2450M CPU @ 2.50GHz (SSE4)
3238 cycles for 1000 * rcpss
13094 cycles for 1000 * 1/x using fdiv
10453 cycles for 1000 * 1/x using divss
3157 cycles for 1000 * rcpss
13125 cycles for 1000 * 1/x using fdiv
10602 cycles for 1000 * 1/x using divss
3151 cycles for 1000 * rcpss
13064 cycles for 1000 * 1/x using fdiv
10644 cycles for 1000 * 1/x using divss
24 bytes for rcpss
23 bytes for 1/x using fdiv
39 bytes for 1/x using divss
ST0 123453440.0000000000
ST0 123456792.0000000000
ST0 123453440.0000000000
movd xmm2, FP4(1.0)
align 4
.Repeat
movaps xmm1, xmm2 ; reload 1.0
divss xmm1, xmm0 ; divide by 123456789
dec ebx
.Until Sign?
Intel(R) Core(TM) i7-4930K CPU @ 3.40GHz (SSE4)
3864 cycles for 1000 * rcpss
15808 cycles for 1000 * 1/x using fdiv
6211 cycles for 1000 * 1/x using divss
3865 cycles for 1000 * rcpss
15789 cycles for 1000 * 1/x using fdiv
6268 cycles for 1000 * 1/x using divss
3866 cycles for 1000 * rcpss
15797 cycles for 1000 * 1/x using fdiv
6190 cycles for 1000 * 1/x using divss
24 bytes for rcpss
23 bytes for 1/x using fdiv
39 bytes for 1/x using divss
ST0 123453440.0000000000
ST0 123456792.0000000000
ST0 123453440.0000000000
Intel(R) Core(TM) i7-6700 CPU @ 3.40GHz (SSE4)
2584 cycles for 1000 * rcpss
16274 cycles for 1000 * 1/x using fdiv
1818 cycles for 1000 * 1/x using divss
2597 cycles for 1000 * rcpss
16282 cycles for 1000 * 1/x using fdiv
1818 cycles for 1000 * 1/x using divss
2587 cycles for 1000 * rcpss
16274 cycles for 1000 * 1/x using fdiv
1818 cycles for 1000 * 1/x using divss
24 bytes for rcpss
23 bytes for 1/x using fdiv
39 bytes for 1/x using divss
ST0 123453440.0000000000
ST0 123456792.0000000000
ST0 123453440.0000000000
--- ok ---
1.60 Ghz - per usual
AMD A6-9220e RADEON R4, 5 COMPUTE CORES 2C+3G (SSE4)
2353 cycles for 1000 * rcpss
15288 cycles for 1000 * 1/x using fdiv
2070 cycles for 1000 * 1/x using divss
2352 cycles for 1000 * rcpss
15347 cycles for 1000 * 1/x using fdiv
2071 cycles for 1000 * 1/x using divss
2353 cycles for 1000 * rcpss
15283 cycles for 1000 * 1/x using fdiv
2069 cycles for 1000 * 1/x using divss
24 bytes for rcpss
23 bytes for 1/x using fdiv
39 bytes for 1/x using divss
ST0 123453440.0000000000
ST0 123456792.0000000000
ST0 123453440.0000000000
--- ok ---
Quote from: daydreamer on June 24, 2018, 08:41:32 PM
Why dont we...
Why don't
you ... ... ... ... ... post your results? :P
The more, the merrier. :biggrin:
Intel(R) Core(TM) i3 CPU 540 @ 3.07GHz (SSE4)
910 cycles for 1000 * rcpss
12488 cycles for 1000 * 1/x using fdiv
12446 cycles for 1000 * 1/x using divss
880 cycles for 1000 * rcpss
12217 cycles for 1000 * 1/x using fdiv
12854 cycles for 1000 * 1/x using divss
1029 cycles for 1000 * rcpss
12393 cycles for 1000 * 1/x using fdiv
12029 cycles for 1000 * 1/x using divss
24 bytes for rcpss
23 bytes for 1/x using fdiv
39 bytes for 1/x using divss
ST0 123453440.0000000000
ST0 123456792.0000000000
ST0 123453440.0000000000
--- ok ---
These results are widely varying.
fdiv is definitely out, but it's a tossup between rcpss and divss.
Quote from: zedd151 on June 25, 2018, 03:05:28 AM
fdiv is definitely out
Not for Yuri's i3 - rcpss is 1/x only, while fdiv and divss are used for generic division. But I agree that on other cpus divss is faster. Whether it matters is another question: Do you have an innermost loop with a Million iterations that needs a division and can live with low precision?
Intel(R) Core(TM) i7-4810MQ CPU @ 2.80GHz (SSE4)
2838 cycles for 1000 * rcpss
12369 cycles for 1000 * 1/x using fdiv
4554 cycles for 1000 * 1/x using divss
2852 cycles for 1000 * rcpss
12214 cycles for 1000 * 1/x using fdiv
4574 cycles for 1000 * 1/x using divss
2873 cycles for 1000 * rcpss
13037 cycles for 1000 * 1/x using fdiv
4570 cycles for 1000 * 1/x using divss
24 bytes for rcpss
23 bytes for 1/x using fdiv
39 bytes for 1/x using divss
ST0 123453440.0000000000
ST0 123456792.0000000000
ST0 123453440.0000000000
--- ok ---
Cut and paste from screen.
F:\TEMP\TEST>1_DIV_X
Intel(R) Pentium(R) M processor 1.70GHz (SSE2)
2088 cycles for 1000 * rcpss
13466 cycles for 1000 * fdiv
2085 cycles for 1000 * rcpss
13432 cycles for 1000 * fdiv
2087 cycles for 1000 * rcpss
13449 cycles for 1000 * fdiv
2050 cycles for 1000 * rcpss
13485 cycles for 1000 * fdiv
2083 cycles for 1000 * rcpss
13454 cycles for 1000 * fdiv
24 bytes for rcpss
23 bytes for fdiv
ST0 123453440.0000000000
ST0 123456792.0000000000
--- ok ---Intel(R) Core(TM) i3-4005U CPU @ 1.70GHz (SSE4)
3732 cycles for 1000 * rcpss
15936 cycles for 1000 * fdiv
3736 cycles for 1000 * rcpss
15921 cycles for 1000 * fdiv
3738 cycles for 1000 * rcpss
15888 cycles for 1000 * fdiv
3762 cycles for 1000 * rcpss
15983 cycles for 1000 * fdiv
3762 cycles for 1000 * rcpss
15903 cycles for 1000 * fdiv
24 bytes for rcpss
23 bytes for fdiv
ST0 123453440.0000000000
ST0 123456792.0000000000
--- ok ---
Output redirected to file.
Intel(R) Pentium(R) M processor 1.70GHz (SSE2)
2246 cycles for 1000 * rcpss
13534 cycles for 1000 * 1/x using fdiv
16239 cycles for 1000 * 1/x using divss
2078 cycles for 1000 * rcpss
13686 cycles for 1000 * 1/x using fdiv
16026 cycles for 1000 * 1/x using divss
2482 cycles for 1000 * rcpss
13335 cycles for 1000 * 1/x using fdiv
16349 cycles for 1000 * 1/x using divss
24 bytes for rcpss
23 bytes for 1/x using fdiv
39 bytes for 1/x using divss
ST0 123453440.0000000000
ST0 123456792.0000000000
ST0 123453440.0000000000
--- ok ---
F:\TEMP\TEST>1_DIV_X
Intel(R) Core(TM) i3-4005U CPU @ 1.70GHz (SSE4)
3770 cycles for 1000 * rcpss
15923 cycles for 1000 * 1/x using fdiv
5992 cycles for 1000 * 1/x using divss
3768 cycles for 1000 * rcpss
15919 cycles for 1000 * 1/x using fdiv
5970 cycles for 1000 * 1/x using divss
3770 cycles for 1000 * rcpss
15932 cycles for 1000 * 1/x using fdiv
5970 cycles for 1000 * 1/x using divss
24 bytes for rcpss
23 bytes for 1/x using fdiv
39 bytes for 1/x using divss
ST0 123453440.0000000000
ST0 123456792.0000000000
ST0 123453440.0000000000
--- ok ---
Intel(R) Celeron(R) CPU N3350 @ 1.10GHz (SSE4)
1433 cycles for 1000 * rcpss
22803 cycles for 1000 * 1/x using fdiv
8498 cycles for 1000 * 1/x using divss
1411 cycles for 1000 * rcpss
21311 cycles for 1000 * 1/x using fdiv
8264 cycles for 1000 * 1/x using divss
1420 cycles for 1000 * rcpss
22523 cycles for 1000 * 1/x using fdiv
8388 cycles for 1000 * 1/x using divss
24 bytes for rcpss
23 bytes for 1/x using fdiv
39 bytes for 1/x using divss
ST0 123453440.0000000000
ST0 123456792.0000000000
ST0 123453440.0000000000
--- ok ---