Normally, the FPU is not slower than equivalent SIMD code, but 1/x beats it:
Intel(R) Core(TM) i5-2450M CPU @ 2.50GHz (SSE4)
3168 cycles for 1000 * rcpss
13049 cycles for 1000 * fdiv
3167 cycles for 1000 * rcpss
13135 cycles for 1000 * fdiv
3181 cycles for 1000 * rcpss
13259 cycles for 1000 * fdiv
3189 cycles for 1000 * rcpss
13092 cycles for 1000 * fdiv
3156 cycles for 1000 * rcpss
13070 cycles for 1000 * fdiv
24 bytes for rcpss
23 bytes for fdiv
ST0 123453440.0000000000
ST0 123456792.0000000000
Of course, precision is lower; the expected value is 123456789.0
The source:
NameA equ rcpss ; assign a descriptive name here
TestA proc
mov ebx, AlgoLoops-1 ; loop 1000x
push 123456789
fild stack
fstp stack
pop eax
movd xmm0, eax
align 4
.Repeat
rcpss xmm0, xmm0
dec ebx
.Until Sign?
movd eax, xmm0
ret
TestA endp
align_64
NameB equ fdiv ; assign a descriptive name here
TestB proc
mov ebx, AlgoLoops-1 ; loop 1000x
push 123456789
fild stack
fstp stack
fld1
align 4
.Repeat
fld stack
fdiv ST, ST(1)
fstp stack
dec ebx
.Until Sign?
fstp st
pop eax
ret
TestB endp