Ciao,
jj2007!
There is soon a new year, so I'll a little flood

It would seem that it could be simpler than fsqrt, but we'll go the other way
1) The best-known integer algorithm for calculating the square root of a number is striking in its simplicity
unsigned sqrt_cpu_int(usigned X)
{ unsigned div = 1, result = 0;
while (X > 0)
{ X -= div; div += 2; result += X < 0 ? 0 : 1; }
return result;
}
The disadvantage of this algorithm is that the number of iterations will increase with increasing X
2) calculation of the square root of the Taylor series expansion.
Let
X be any number;
f(X) is a function depending on
X; a is a known number close to
X;
f(a) is a known value of the function.
We expand
f(X) in a Taylor series:
f(X)=f(a)+(X-a)f'(a)+((X-a)²∙f"(a))/2!+...+((X-a)
n∙f
n(a))/n!
Let
X be the number from which to extract the root. Then f(X)=√X; a is a known number close to
X;
f(a)=√a is a known number close to
√X, and
f(X)=√a+(X-a)/(2√a)+...=(2a+X-a)/(2√a)=(a+X)/(2√a)The value
√X can be found by setting
√a and then calculating
f(X).
f(X)² can be compared to the original number
X. If the accuracy is insufficient, then the number a is replaced by
f(X)², √
a by
f(X) and the calculation is repeated
3) the search for an integer square root by Newton's method begins with a certain value of g
0, which is the initial estimate
√X. Then a series of refinements of the value of the square root is performed by the formula
gn+1 = (gn + X/gn)/2. To reduce the number of iterations, you can, in the first step, select the initial value for the variable
g0{usigned x1;
int a, g0, g1;
if (x <= 1) return x;
a = 1;
x1 = x-1;
if (x1> 65535) {a = a + 8; x1 = x1 >> 16;}
if (x1> 255) {a = a + 4; x1 = x1 >> 8;}
if (x1> 15) {a = a + 2; x1 = x1 >> 4;}
if (x1> 3) {a = a + 1;}
g0 = 1 << a; // g0 = 2 //
g1 = (g0 + (x >> a)) >> 1;
while (g1 <g0) // we repeat, while the approximations are strictly reduced
{g0 = g1; g1 = (g0 + (x/g0)) >> 1}
}
4) Since the division is a fairly slow operation, you can abandon it. To calculate the square root, we use multiplication. A 32-bit X number will have a 16-bit square root. At each step, the 1 bit of the root value is refined. For the acceleration, the calculation "without jumps" is made to add or subtract the value in the corresponding digit
mov ebx, 4000h
mov ecx, 8000h
mov eax, 40000000h
@@: cmp eax, X
je @f; premature exit from the cycle
sbb edx, edx; if CF = 1 then edx = 0FFFFFFFFh otherwise edx = 0
sub ecx, ebx
and edx, ebx; if CF = 1 then edx = ebx otherwise edx = 0
lea ecx, [ecx + edx * 2]; if CF = 1 then ecx = ecx + ebx otherwise ecx = ecx-ebx
shr ebx, 1; move to the next level
mov eax, ecx
mul eax; we get "eax squared"
test ebx, ebx
jnz @b
@@: mov eax, ecx; eax: = sqrt (X)
It is not clear how, neither division, nor multiplication, but it works! And the algorithm was described already in 1945 in Von Neumann J. "First Draft of a Reaport on the EDVAC"
mov ebp,X
mov ecx,40000000h
xor eax,eax
@@: mov ebx,eax
or ebx,ecx
shr eax,1
cmp ebp,ebx
sbb edx,edx
not edx
mov edi,edx
and edx,ebx
and edi,ecx
sub ebp,edx
or eax,edi
shr ecx,2
test ecx,ecx
jnz @b
;eax=sqrt(X)
Or
mov ebp, X
inc ebp
bsr ecx, ebp
and ecx, -2
mov ebx, 1
shl ebx, cl; to reduce the number of iterations
xor eax, eax
@@: lea ecx, [eax + ebx]
shr eax, 1
cmp ecx, ebp
sbb edx, edx
mov edi, edx
and edx, ecx
sub ebp, edx
and edi, ebx
or eax, edi
shr ebx, 2
jnz @b
maybe these algorithms will be faster than fsqrt with 5703 cycles for 100