The MASM Forum
General => The Laboratory => Topic started by: jj2007 on December 30, 2017, 09:25:57 PM

Category "almost unknown SIMD instructions" ;)
Intel(R) Core(TM) i52450M CPU @ 2.50GHz (SSE4)
5703 cycles for 100 * fsqrt
4380 cycles for 100 * rsqrtss
5684 cycles for 100 * fsqrt
4436 cycles for 100 * rsqrtss
5683 cycles for 100 * fsqrt
4435 cycles for 100 * rsqrtss
5646 cycles for 100 * fsqrt
4420 cycles for 100 * rsqrtss
5699 cycles for 100 * fsqrt
4396 cycles for 100 * rsqrtss
87 bytes for fsqrt
97 bytes for rsqrtss
9483.79199 = fsqrt
9483.57227 = rsqrtss
RSQRTSSScalar SinglePrecision FloatingPoint Square Root Reciprocal
Computes an approximate reciprocal of the square root of the low singleprecision floatingpoint value in the source operand (second operand), stores the singleprecision floatingpoint result in the destination operand. The maximum error for this approximation is (1.5 * 212). The source operand can be an XMM register or a 32bit memory location. The destination operand is an XMM register.

Ciao, jj2007!
There is soon a new year, so I'll a little flood (https://wasm.in/styles/smiles_s/smile3.gif)
It would seem that it could be simpler than fsqrt, but we'll go the other way
1) The bestknown integer algorithm for calculating the square root of a number is striking in its simplicity
unsigned sqrt_cpu_int(usigned X)
{ unsigned div = 1, result = 0;
while (X > 0)
{ X = div; div += 2; result += X < 0 ? 0 : 1; }
return result;
}
The disadvantage of this algorithm is that the number of iterations will increase with increasing X
2) calculation of the square root of the Taylor series expansion.
Let X be any number; f(X) is a function depending on X; a is a known number close to X; f(a) is a known value of the function.
We expand f(X) in a Taylor series:
f(X)=f(a)+(Xa)f'(a)+((Xa)²∙f"(a))/2!+...+((Xa)^{n}∙f^{n}(a))/n!
Let X be the number from which to extract the root. Then f(X)=√X; a is a known number close to X; f(a)=√a is a known number close to √X, and f(X)=√a+(Xa)/(2√a)+...=(2a+Xa)/(2√a)=(a+X)/(2√a)
The value √X can be found by setting √a and then calculating f(X). f(X)² can be compared to the original number X. If the accuracy is insufficient, then the number a is replaced by f(X)², √a by f(X) and the calculation is repeated
3) the search for an integer square root by Newton's method begins with a certain value of g_{0}, which is the initial estimate √X. Then a series of refinements of the value of the square root is performed by the formula g_{n+1} = (g_{n} + X/g_{n})/2. To reduce the number of iterations, you can, in the first step, select the initial value for the variable g_{0}{usigned x1;
int a, g0, g1;
if (x <= 1) return x;
a = 1;
x1 = x1;
if (x1> 65535) {a = a + 8; x1 = x1 >> 16;}
if (x1> 255) {a = a + 4; x1 = x1 >> 8;}
if (x1> 15) {a = a + 2; x1 = x1 >> 4;}
if (x1> 3) {a = a + 1;}
g0 = 1 << a; // g0 = 2 //
g1 = (g0 + (x >> a)) >> 1;
while (g1 <g0) // we repeat, while the approximations are strictly reduced
{g0 = g1; g1 = (g0 + (x/g0)) >> 1}
}
4) Since the division is a fairly slow operation, you can abandon it. To calculate the square root, we use multiplication. A 32bit X number will have a 16bit square root. At each step, the 1 bit of the root value is refined. For the acceleration, the calculation "without jumps" is made to add or subtract the value in the corresponding digit mov ebx, 4000h
mov ecx, 8000h
mov eax, 40000000h
@@: cmp eax, X
je @f; premature exit from the cycle
sbb edx, edx; if CF = 1 then edx = 0FFFFFFFFh otherwise edx = 0
sub ecx, ebx
and edx, ebx; if CF = 1 then edx = ebx otherwise edx = 0
lea ecx, [ecx + edx * 2]; if CF = 1 then ecx = ecx + ebx otherwise ecx = ecxebx
shr ebx, 1; move to the next level
mov eax, ecx
mul eax; we get "eax squared"
test ebx, ebx
jnz @b
@@: mov eax, ecx; eax: = sqrt (X)
It is not clear how, neither division, nor multiplication, but it works! And the algorithm was described already in 1945 in Von Neumann J. "First Draft of a Reaport on the EDVAC" mov ebp,X
mov ecx,40000000h
xor eax,eax
@@: mov ebx,eax
or ebx,ecx
shr eax,1
cmp ebp,ebx
sbb edx,edx
not edx
mov edi,edx
and edx,ebx
and edi,ecx
sub ebp,edx
or eax,edi
shr ecx,2
test ecx,ecx
jnz @b
;eax=sqrt(X)
Ormov ebp, X
inc ebp
bsr ecx, ebp
and ecx, 2
mov ebx, 1
shl ebx, cl; to reduce the number of iterations
xor eax, eax
@@: lea ecx, [eax + ebx]
shr eax, 1
cmp ecx, ebp
sbb edx, edx
mov edi, edx
and edx, ecx
sub ebp, edx
and edi, ebx
or eax, edi
shr ebx, 2
jnz @b
maybe these algorithms will be faster than fsqrt with 5703 cycles for 100

Hi Mikl__,
You have set out a very good overview of possible alternative algorithms. :t Here (https://www.codeproject.com/Articles/69941/BestSquareRootMethodAlgorithmFunctionPrecisi) you will find an interesting article on these questions. This is done in the high level language, but it should not be a problem to realize that in assembly language.
Gunther

If you come up with a convincing MASM algo, I can add it :bgrin:
(I doubt it will be any faster  if it's a good algo, the Intel guys realised it in microops)

Hallo, Gunther!
Danke für den interessanten Artikel! Ich werde definitiv in AssemblerSprache übersetzen und die empfangenen Programme auf Geschwindigkeit prüfen. Glückliches neues Jahr!

Mikl__,
Желаю вам хорошего здоровья и счастья в следующем году.
Gunther

JJ,
Why don't you mention SQRTSS or SQRTPS? Do you prefer reciprocals? :shock:

Why don't you mention SQRTSS or SQRTPS? Do you prefer reciprocals? :shock:
Good idea:Intel(R) Core(TM) i52450M CPU @ 2.50GHz (SSE4)
5536 cycles for 100 * fsqrt
4300 cycles for 100 * rsqrtss
4638 cycles for 100 * sqrtss
5639 cycles for 100 * sqrtps
5532 cycles for 100 * fsqrt
4301 cycles for 100 * rsqrtss
4638 cycles for 100 * sqrtss
5641 cycles for 100 * sqrtps
5538 cycles for 100 * fsqrt
4301 cycles for 100 * rsqrtss
4640 cycles for 100 * sqrtss
5674 cycles for 100 * sqrtps
5555 cycles for 100 * fsqrt
4304 cycles for 100 * rsqrtss
4640 cycles for 100 * sqrtss
5709 cycles for 100 * sqrtps
5548 cycles for 100 * fsqrt
4305 cycles for 100 * rsqrtss
4641 cycles for 100 * sqrtss
5675 cycles for 100 * sqrtps
87 bytes for fsqrt
97 bytes for rsqrtss
89 bytes for sqrtss
87 bytes for sqrtps
9483.79199 = fsqrt
9483.57227 = rsqrtss
9483.79199 = sqrtss
1.16594337e+09 = sqrtps
The last result looks wrong, though ::)
Rand(0.0, 123456789.0, MyReal4)
sqrtps xmm0, oword ptr MyReal4
movaps oword ptr MyReal4, xmm0

Buon anno, jj! (https://wasm.in/styles/smiles_s/party.gif)
Intel(R) Core(TM) i33210 CPU @ 3.20GHz (SSE4)
6647 cycles for 100 * fsqrt
5242 cycles for 100 * rsqrtss
5549 cycles for 100 * sqrtss
6665 cycles for 100 * sqrtps
6685 cycles for 100 * fsqrt
5245 cycles for 100 * rsqrtss
5544 cycles for 100 * sqrtss
6653 cycles for 100 * sqrtps
6657 cycles for 100 * fsqrt
5253 cycles for 100 * rsqrtss
5562 cycles for 100 * sqrtss
6653 cycles for 100 * sqrtps
6775 cycles for 100 * fsqrt
5310 cycles for 100 * rsqrtss
5543 cycles for 100 * sqrtss
6654 cycles for 100 * sqrtps
6655 cycles for 100 * fsqrt
5251 cycles for 100 * rsqrtss
5550 cycles for 100 * sqrtss
6647 cycles for 100 * sqrtps
87 bytes for fsqrt
97 bytes for rsqrtss
89 bytes for sqrtss
87 bytes for sqrtps
9483.79199 = fsqrt
9483.57227 = rsqrtss
9483.79199 = sqrtss
1.16594337e+09 = sqrtps

Buon anno a te, Mikl :icon14:

Is "oword" a typo, a Johen's invention, or something I'm about to learn? :dazzled:

oword = 128 bits. Or 16 bytes. It's a data identifier like qword, dword, etc.

MASM likes XMMWORD and YMMWORD. I think an earlier version of 32 bit MASM used OWORD.

thanks Felipe and Hutch, this one is new to me!

Ciao, jj2007!
There is soon a new year, so I'll a little flood (https://wasm.in/styles/smiles_s/smile3.gif)
It would seem that it could be simpler than fsqrt, but we'll go the other way
1) The bestknown integer algorithm for calculating the square root of a number is striking in its simplicity
unsigned sqrt_cpu_int(usigned X)
{ unsigned div = 1, result = 0;
while (X > 0)
{ X = div; div += 2; result += X < 0 ? 0 : 1; }
return result;
}
The disadvantage of this algorithm is that the number of iterations will increase with increasing X
2) calculation of the square root of the Taylor series expansion.
Let X be any number; f(X) is a function depending on X; a is a known number close to X; f(a) is a known value of the function.
We expand f(X) in a Taylor series:
f(X)=f(a)+(Xa)f'(a)+((Xa)²∙f"(a))/2!+...+((Xa)^{n}∙f^{n}(a))/n!
Let X be the number from which to extract the root. Then f(X)=√X; a is a known number close to X; f(a)=√a is a known number close to √X, and f(X)=√a+(Xa)/(2√a)+...=(2a+Xa)/(2√a)=(a+X)/(2√a)
The value √X can be found by setting √a and then calculating f(X). f(X)² can be compared to the original number X. If the accuracy is insufficient, then the number a is replaced by f(X)², √a by f(X) and the calculation is repeated
3) the search for an integer square root by Newton's method begins with a certain value of g_{0}, which is the initial estimate √X. Then a series of refinements of the value of the square root is performed by the formula g_{n+1} = (g_{n} + X/g_{n})/2. To reduce the number of iterations, you can, in the first step, select the initial value for the variable g_{0}{usigned x1;
int a, g0, g1;
if (x <= 1) return x;
a = 1;
x1 = x1;
if (x1> 65535) {a = a + 8; x1 = x1 >> 16;}
if (x1> 255) {a = a + 4; x1 = x1 >> 8;}
if (x1> 15) {a = a + 2; x1 = x1 >> 4;}
if (x1> 3) {a = a + 1;}
g0 = 1 << a; // g0 = 2 //
g1 = (g0 + (x >> a)) >> 1;
while (g1 <g0) // we repeat, while the approximations are strictly reduced
{g0 = g1; g1 = (g0 + (x/g0)) >> 1}
}
4) Since the division is a fairly slow operation, you can abandon it. To calculate the square root, we use multiplication. A 32bit X number will have a 16bit square root. At each step, the 1 bit of the root value is refined. For the acceleration, the calculation "without jumps" is made to add or subtract the value in the corresponding digit mov ebx, 4000h
mov ecx, 8000h
mov eax, 40000000h
@@: cmp eax, X
je @f; premature exit from the cycle
sbb edx, edx; if CF = 1 then edx = 0FFFFFFFFh otherwise edx = 0
sub ecx, ebx
and edx, ebx; if CF = 1 then edx = ebx otherwise edx = 0
lea ecx, [ecx + edx * 2]; if CF = 1 then ecx = ecx + ebx otherwise ecx = ecxebx
shr ebx, 1; move to the next level
mov eax, ecx
mul eax; we get "eax squared"
test ebx, ebx
jnz @b
@@: mov eax, ecx; eax: = sqrt (X)
It is not clear how, neither division, nor multiplication, but it works! And the algorithm was described already in 1945 in Von Neumann J. "First Draft of a Reaport on the EDVAC" mov ebp,X
mov ecx,40000000h
xor eax,eax
@@: mov ebx,eax
or ebx,ecx
shr eax,1
cmp ebp,ebx
sbb edx,edx
not edx
mov edi,edx
and edx,ebx
and edi,ecx
sub ebp,edx
or eax,edi
shr ecx,2
test ecx,ecx
jnz @b
;eax=sqrt(X)
Ormov ebp, X
inc ebp
bsr ecx, ebp
and ecx, 2
mov ebx, 1
shl ebx, cl; to reduce the number of iterations
xor eax, eax
@@: lea ecx, [eax + ebx]
shr eax, 1
cmp ecx, ebp
sbb edx, edx
mov edi, edx
and edx, ecx
sub ebp, edx
and edi, ebx
or eax, edi
shr ebx, 2
jnz @b
maybe these algorithms will be faster than fsqrt with 5703 cycles for 100
would be interesting to code it with help of SSE2 integer,think you could call different versions of same algo depending on size,,Think the speed if all variables are byte sized in a 128bit XMM regs

Hi, daydreamer2!
and to you "happy new year!" While there are New Year holidays, I'll think about applying SSE2 in calculating the square root (http://kolobok.us/smiles/standart/drinks.gif)
(http://forumsmile.ru/u/a/5/4/a54424ca622a643b0c8f5b5636e3fcff.png)

Hi, daydreamer2!
and to you "happy new year!" While there are New Year holidays, I'll think about applying SSE2 in calculating the square root (http://kolobok.us/smiles/standart/drinks.gif)
(http://forumsmile.ru/u/a/5/4/a54424ca622a643b0c8f5b5636e3fcff.png)
happy new year Miki :bgrin: