Category "almost unknown SIMD instructions" ;)
Intel(R) Core(TM) i5-2450M CPU @ 2.50GHz (SSE4)
5703 cycles for 100 * fsqrt
4380 cycles for 100 * rsqrtss
5684 cycles for 100 * fsqrt
4436 cycles for 100 * rsqrtss
5683 cycles for 100 * fsqrt
4435 cycles for 100 * rsqrtss
5646 cycles for 100 * fsqrt
4420 cycles for 100 * rsqrtss
5699 cycles for 100 * fsqrt
4396 cycles for 100 * rsqrtss
87 bytes for fsqrt
97 bytes for rsqrtss
9483.79199 = fsqrt
9483.57227 = rsqrtss
RSQRTSS--Scalar Single-Precision Floating-Point Square Root Reciprocal
Computes an approximate reciprocal of the square root of the low single-precision floating-point value in the source operand (second operand), stores the single-precision floating-point result in the destination operand. The maximum error for this approximation is (1.5 * 2-12). The source operand can be an XMM register or a 32-bit memory location. The destination operand is an XMM register.
Ciao, jj2007!
There is soon a new year, so I'll a little flood (https://wasm.in/styles/smiles_s/smile3.gif)
It would seem that it could be simpler than fsqrt, but we'll go the other way
1) The best-known integer algorithm for calculating the square root of a number is striking in its simplicity
unsigned sqrt_cpu_int(usigned X)
{ unsigned div = 1, result = 0;
while (X > 0)
{ X -= div; div += 2; result += X < 0 ? 0 : 1; }
return result;
}
The disadvantage of this algorithm is that the number of iterations will increase with increasing X
2) calculation of the square root of the Taylor series expansion.
Let X be any number; f(X) is a function depending on X; a is a known number close to X; f(a) is a known value of the function.
We expand f(X) in a Taylor series:
f(X)=f(a)+(X-a)f'(a)+((X-a)²∙f"(a))/2!+...+((X-a)n∙fn(a))/n!
Let X be the number from which to extract the root. Then f(X)=√X; a is a known number close to X; f(a)=√a is a known number close to √X, and f(X)=√a+(X-a)/(2√a)+...=(2a+X-a)/(2√a)=(a+X)/(2√a)
The value √X can be found by setting √a and then calculating f(X). f(X)² can be compared to the original number X. If the accuracy is insufficient, then the number a is replaced by f(X)², √a by f(X) and the calculation is repeated
3) the search for an integer square root by Newton's method begins with a certain value of g0, which is the initial estimate √X. Then a series of refinements of the value of the square root is performed by the formula gn+1 = (gn + X/gn)/2. To reduce the number of iterations, you can, in the first step, select the initial value for the variable g0{usigned x1;
int a, g0, g1;
if (x <= 1) return x;
a = 1;
x1 = x-1;
if (x1> 65535) {a = a + 8; x1 = x1 >> 16;}
if (x1> 255) {a = a + 4; x1 = x1 >> 8;}
if (x1> 15) {a = a + 2; x1 = x1 >> 4;}
if (x1> 3) {a = a + 1;}
g0 = 1 << a; // g0 = 2 //
g1 = (g0 + (x >> a)) >> 1;
while (g1 <g0) // we repeat, while the approximations are strictly reduced
{g0 = g1; g1 = (g0 + (x/g0)) >> 1}
}
4) Since the division is a fairly slow operation, you can abandon it. To calculate the square root, we use multiplication. A 32-bit X number will have a 16-bit square root. At each step, the 1 bit of the root value is refined. For the acceleration, the calculation "without jumps" is made to add or subtract the value in the corresponding digit mov ebx, 4000h
mov ecx, 8000h
mov eax, 40000000h
@@: cmp eax, X
je @f; premature exit from the cycle
sbb edx, edx; if CF = 1 then edx = 0FFFFFFFFh otherwise edx = 0
sub ecx, ebx
and edx, ebx; if CF = 1 then edx = ebx otherwise edx = 0
lea ecx, [ecx + edx * 2]; if CF = 1 then ecx = ecx + ebx otherwise ecx = ecx-ebx
shr ebx, 1; move to the next level
mov eax, ecx
mul eax; we get "eax squared"
test ebx, ebx
jnz @b
@@: mov eax, ecx; eax: = sqrt (X)
It is not clear how, neither division, nor multiplication, but it works! And the algorithm was described already in 1945 in Von Neumann J. "First Draft of a Reaport on the EDVAC" mov ebp,X
mov ecx,40000000h
xor eax,eax
@@: mov ebx,eax
or ebx,ecx
shr eax,1
cmp ebp,ebx
sbb edx,edx
not edx
mov edi,edx
and edx,ebx
and edi,ecx
sub ebp,edx
or eax,edi
shr ecx,2
test ecx,ecx
jnz @b
;eax=sqrt(X)
Ormov ebp, X
inc ebp
bsr ecx, ebp
and ecx, -2
mov ebx, 1
shl ebx, cl; to reduce the number of iterations
xor eax, eax
@@: lea ecx, [eax + ebx]
shr eax, 1
cmp ecx, ebp
sbb edx, edx
mov edi, edx
and edx, ecx
sub ebp, edx
and edi, ebx
or eax, edi
shr ebx, 2
jnz @b
maybe these algorithms will be faster than fsqrt with 5703 cycles for 100
Hi Mikl__,
You have set out a very good overview of possible alternative algorithms. :t Here (https://www.codeproject.com/Articles/69941/Best-Square-Root-Method-Algorithm-Function-Precisi) you will find an interesting article on these questions. This is done in the high level language, but it should not be a problem to realize that in assembly language.
Gunther
If you come up with a convincing MASM algo, I can add it :bgrin:
(I doubt it will be any faster - if it's a good algo, the Intel guys realised it in micro-ops)
Hallo, Gunther!
Danke für den interessanten Artikel! Ich werde definitiv in Assembler-Sprache übersetzen und die empfangenen Programme auf Geschwindigkeit prüfen. Glückliches neues Jahr!
Mikl__,
Желаю вам хорошего здоровья и счастья в следующем году.
Gunther
JJ,
Why don't you mention SQRTSS or SQRTPS? Do you prefer reciprocals? :shock:
Quote from: aw27 on December 31, 2017, 03:06:00 AMWhy don't you mention SQRTSS or SQRTPS? Do you prefer reciprocals? :shock:
Good idea:
Intel(R) Core(TM) i5-2450M CPU @ 2.50GHz (SSE4)
5536 cycles for 100 * fsqrt
4300 cycles for 100 * rsqrtss
4638 cycles for 100 * sqrtss
5639 cycles for 100 * sqrtps
5532 cycles for 100 * fsqrt
4301 cycles for 100 * rsqrtss
4638 cycles for 100 * sqrtss
5641 cycles for 100 * sqrtps
5538 cycles for 100 * fsqrt
4301 cycles for 100 * rsqrtss
4640 cycles for 100 * sqrtss
5674 cycles for 100 * sqrtps
5555 cycles for 100 * fsqrt
4304 cycles for 100 * rsqrtss
4640 cycles for 100 * sqrtss
5709 cycles for 100 * sqrtps
5548 cycles for 100 * fsqrt
4305 cycles for 100 * rsqrtss
4641 cycles for 100 * sqrtss
5675 cycles for 100 * sqrtps
87 bytes for fsqrt
97 bytes for rsqrtss
89 bytes for sqrtss
87 bytes for sqrtps
9483.79199 = fsqrt
9483.57227 = rsqrtss
9483.79199 = sqrtss
1.16594337e+09 = sqrtps
The last result looks wrong, though ::)
Rand(0.0, 123456789.0, MyReal4)
sqrtps xmm0, oword ptr MyReal4
movaps oword ptr MyReal4, xmm0
Buon anno, jj! (https://wasm.in/styles/smiles_s/party.gif)
Intel(R) Core(TM) i3-3210 CPU @ 3.20GHz (SSE4)
6647 cycles for 100 * fsqrt
5242 cycles for 100 * rsqrtss
5549 cycles for 100 * sqrtss
6665 cycles for 100 * sqrtps
6685 cycles for 100 * fsqrt
5245 cycles for 100 * rsqrtss
5544 cycles for 100 * sqrtss
6653 cycles for 100 * sqrtps
6657 cycles for 100 * fsqrt
5253 cycles for 100 * rsqrtss
5562 cycles for 100 * sqrtss
6653 cycles for 100 * sqrtps
6775 cycles for 100 * fsqrt
5310 cycles for 100 * rsqrtss
5543 cycles for 100 * sqrtss
6654 cycles for 100 * sqrtps
6655 cycles for 100 * fsqrt
5251 cycles for 100 * rsqrtss
5550 cycles for 100 * sqrtss
6647 cycles for 100 * sqrtps
87 bytes for fsqrt
97 bytes for rsqrtss
89 bytes for sqrtss
87 bytes for sqrtps
9483.79199 = fsqrt
9483.57227 = rsqrtss
9483.79199 = sqrtss
1.16594337e+09 = sqrtps
Buon anno a te, Mikl :icon14:
Is "oword" a typo, a Johen's invention, or something I'm about to learn? :dazzled:
oword = 128 bits. Or 16 bytes. It's a data identifier like qword, dword, etc.
MASM likes XMMWORD and YMMWORD. I think an earlier version of 32 bit MASM used OWORD.
thanks Felipe and Hutch, this one is new to me!
Quote from: Mikl__ on December 30, 2017, 10:47:53 PM
Ciao, jj2007!
There is soon a new year, so I'll a little flood (https://wasm.in/styles/smiles_s/smile3.gif)
It would seem that it could be simpler than fsqrt, but we'll go the other way
1) The best-known integer algorithm for calculating the square root of a number is striking in its simplicity
unsigned sqrt_cpu_int(usigned X)
{ unsigned div = 1, result = 0;
while (X > 0)
{ X -= div; div += 2; result += X < 0 ? 0 : 1; }
return result;
}
The disadvantage of this algorithm is that the number of iterations will increase with increasing X
2) calculation of the square root of the Taylor series expansion.
Let X be any number; f(X) is a function depending on X; a is a known number close to X; f(a) is a known value of the function.
We expand f(X) in a Taylor series:
f(X)=f(a)+(X-a)f'(a)+((X-a)²∙f"(a))/2!+...+((X-a)n∙fn(a))/n!
Let X be the number from which to extract the root. Then f(X)=√X; a is a known number close to X; f(a)=√a is a known number close to √X, and f(X)=√a+(X-a)/(2√a)+...=(2a+X-a)/(2√a)=(a+X)/(2√a)
The value √X can be found by setting √a and then calculating f(X). f(X)² can be compared to the original number X. If the accuracy is insufficient, then the number a is replaced by f(X)², √a by f(X) and the calculation is repeated
3) the search for an integer square root by Newton's method begins with a certain value of g0, which is the initial estimate √X. Then a series of refinements of the value of the square root is performed by the formula gn+1 = (gn + X/gn)/2. To reduce the number of iterations, you can, in the first step, select the initial value for the variable g0{usigned x1;
int a, g0, g1;
if (x <= 1) return x;
a = 1;
x1 = x-1;
if (x1> 65535) {a = a + 8; x1 = x1 >> 16;}
if (x1> 255) {a = a + 4; x1 = x1 >> 8;}
if (x1> 15) {a = a + 2; x1 = x1 >> 4;}
if (x1> 3) {a = a + 1;}
g0 = 1 << a; // g0 = 2 //
g1 = (g0 + (x >> a)) >> 1;
while (g1 <g0) // we repeat, while the approximations are strictly reduced
{g0 = g1; g1 = (g0 + (x/g0)) >> 1}
}
4) Since the division is a fairly slow operation, you can abandon it. To calculate the square root, we use multiplication. A 32-bit X number will have a 16-bit square root. At each step, the 1 bit of the root value is refined. For the acceleration, the calculation "without jumps" is made to add or subtract the value in the corresponding digit mov ebx, 4000h
mov ecx, 8000h
mov eax, 40000000h
@@: cmp eax, X
je @f; premature exit from the cycle
sbb edx, edx; if CF = 1 then edx = 0FFFFFFFFh otherwise edx = 0
sub ecx, ebx
and edx, ebx; if CF = 1 then edx = ebx otherwise edx = 0
lea ecx, [ecx + edx * 2]; if CF = 1 then ecx = ecx + ebx otherwise ecx = ecx-ebx
shr ebx, 1; move to the next level
mov eax, ecx
mul eax; we get "eax squared"
test ebx, ebx
jnz @b
@@: mov eax, ecx; eax: = sqrt (X)
It is not clear how, neither division, nor multiplication, but it works! And the algorithm was described already in 1945 in Von Neumann J. "First Draft of a Reaport on the EDVAC" mov ebp,X
mov ecx,40000000h
xor eax,eax
@@: mov ebx,eax
or ebx,ecx
shr eax,1
cmp ebp,ebx
sbb edx,edx
not edx
mov edi,edx
and edx,ebx
and edi,ecx
sub ebp,edx
or eax,edi
shr ecx,2
test ecx,ecx
jnz @b
;eax=sqrt(X)
Ormov ebp, X
inc ebp
bsr ecx, ebp
and ecx, -2
mov ebx, 1
shl ebx, cl; to reduce the number of iterations
xor eax, eax
@@: lea ecx, [eax + ebx]
shr eax, 1
cmp ecx, ebp
sbb edx, edx
mov edi, edx
and edx, ecx
sub ebp, edx
and edi, ebx
or eax, edi
shr ebx, 2
jnz @b
maybe these algorithms will be faster than fsqrt with 5703 cycles for 100
would be interesting to code it with help of SSE2 integer,think you could call different versions of same algo depending on size,,Think the speed if all variables are byte sized in a 128bit XMM regs
Hi, daydreamer2!
and to you "happy new year!" While there are New Year holidays, I'll think about applying SSE2 in calculating the square root (http://kolobok.us/smiles/standart/drinks.gif)
(http://forumsmile.ru/u/a/5/4/a54424ca622a643b0c8f5b5636e3fcff.png)
Quote from: Mikl__ on December 31, 2017, 09:45:58 PM
Hi, daydreamer2!
and to you "happy new year!" While there are New Year holidays, I'll think about applying SSE2 in calculating the square root (http://kolobok.us/smiles/standart/drinks.gif)
(http://forumsmile.ru/u/a/5/4/a54424ca622a643b0c8f5b5636e3fcff.png)
happy new year Miki :bgrin:
hi
want this timings as research for prime test optimisation
starting with my cpu
also got reminded by Zedd that some members has got newer computers after this test
also interested in div,fdiv,divss,divps,divsd,divpd timings
also interested in mul,fmul,mulss,mulps,mulsd,mulpd timings
Intel(R) Core(TM) i5-7200U CPU @ 2.50GHz (SSE4)
5432 cycles for 100 * fsqrt
4205 cycles for 100 * rsqrtss
4559 cycles for 100 * sqrtss
5435 cycles for 100 * sqrtps
5392 cycles for 100 * fsqrt
4204 cycles for 100 * rsqrtss
4553 cycles for 100 * sqrtss
5450 cycles for 100 * sqrtps
5385 cycles for 100 * fsqrt
4215 cycles for 100 * rsqrtss
4551 cycles for 100 * sqrtss
5435 cycles for 100 * sqrtps
5386 cycles for 100 * fsqrt
4198 cycles for 100 * rsqrtss
4557 cycles for 100 * sqrtss
5451 cycles for 100 * sqrtps
5407 cycles for 100 * fsqrt
4203 cycles for 100 * rsqrtss
4553 cycles for 100 * sqrtss
5449 cycles for 100 * sqrtps
87 bytes for fsqrt
97 bytes for rsqrtss
89 bytes for sqrtss
87 bytes for sqrtps
9483.79199 = fsqrt
9483.57227 = rsqrtss
9483.79199 = sqrtss
1.16594337e+09 = sqrtps
-
Per Magnus's suggestion...
Dell Optiplex 7050 SFF PC:
Intel(R) Core(TM) i7-7700 CPU @ 3.60GHz (SSE4)
5367 cycles for 100 * fsqrt
4180 cycles for 100 * rsqrtss
4563 cycles for 100 * sqrtss
5435 cycles for 100 * sqrtps
5364 cycles for 100 * fsqrt
4173 cycles for 100 * rsqrtss
4527 cycles for 100 * sqrtss
5384 cycles for 100 * sqrtps
5366 cycles for 100 * fsqrt
4178 cycles for 100 * rsqrtss
4543 cycles for 100 * sqrtss
5475 cycles for 100 * sqrtps
5464 cycles for 100 * fsqrt
4241 cycles for 100 * rsqrtss
4569 cycles for 100 * sqrtss
5499 cycles for 100 * sqrtps
5410 cycles for 100 * fsqrt
4200 cycles for 100 * rsqrtss
4561 cycles for 100 * sqrtss
5424 cycles for 100 * sqrtps
87 bytes for fsqrt
97 bytes for rsqrtss
89 bytes for sqrtss
87 bytes for sqrtps
9483.79199 = fsqrt
9483.57227 = rsqrtss
9483.79199 = sqrtss
1.16594337e+09 = sqrtps
OTVOC 15 laptop:
Intel(R) Celeron(R) N5105 @ 2.00GHz (SSE4)
4494 cycles for 100 * fsqrt
3734 cycles for 100 * rsqrtss
3788 cycles for 100 * sqrtss
4906 cycles for 100 * sqrtps
4481 cycles for 100 * fsqrt
3709 cycles for 100 * rsqrtss
3835 cycles for 100 * sqrtss
4708 cycles for 100 * sqrtps
4462 cycles for 100 * fsqrt
3575 cycles for 100 * rsqrtss
3904 cycles for 100 * sqrtss
4838 cycles for 100 * sqrtps
4536 cycles for 100 * fsqrt
3637 cycles for 100 * rsqrtss
3879 cycles for 100 * sqrtss
4673 cycles for 100 * sqrtps
4409 cycles for 100 * fsqrt
3513 cycles for 100 * rsqrtss
3748 cycles for 100 * sqrtss
4649 cycles for 100 * sqrtps
87 bytes for fsqrt
97 bytes for rsqrtss
89 bytes for sqrtss
87 bytes for sqrtps
9483.79199 = fsqrt
9483.57227 = rsqrtss
9483.79199 = sqrtss
1.16594337e+09 = sqrtps
:cool: Not too shabby, eh?
Thanks,looks good zedd
fastest sqrt ever :biggrin:
Quote from: daydreamer on January 11, 2025, 04:41:50 AMfastest sqrt ever :biggrin:
That is a bold statement, Magnus. That sounds like a challenge. :biggrin:
We'll see... I'll look at that later. :cool:
A few short minutes later...
"shape2.com" :rolleyes:
Okay, lemme get dosbox...
(https://i.postimg.cc/9Mq4yZMH/SHAPE2.png)
Is that what it is supposed to look like? (its the final screen)
Its very fast drawing 4 tiles rotated 90 degrees
Inspired by pixelshaders pixel drawing things with formula per pixel
After that developed straight road,sinus road,45 degree road and scroll thru all 640kb
It was beginning of develop demo that drawed a racetrack or river
Sqrt For purpose of limit prime division test,a sqrt integer proc also has the bonus of detecting some non primes :
Sqrt(4),sqrt(16),sqrt(25),sqrt(36),sqrt(100) ... 0 fraction = can skip division loop entirely
Sqrt only for integer in prime test loop :
;high byte lut only approximate sqrt
;low byte lut sqrt nearest integer
If eax>256
Mov al,ah
Xlat high byte lut
Else
Xlat low byte out
End if
I bet this could be fast
Which is better?
The 1st uses fewer bytes, the 2nd uses fewer cycles, the 3rd is 100% accurate.
That's enough effort spent on square roots for now.
vcvtusi2ss xmm0,xmm0,rcx ;unsigned long long to float
vrsqrt14ss xmm1,xmm0,xmm0
vmulss xmm3,xmm1,xmm0
vfmsub231ss xmm0,xmm3,xmm3
vaddss xmm2,xmm3,xmm3
vrcpss xmm2,xmm2,xmm2
vfnmadd231ss xmm3,xmm0,xmm2
vcvttss2usi rax,xmm3
vs
vcvtusi2ss xmm0,xmm0,rcx ;unsigned long long to float
vrsqrt14ss xmm1,xmm0,xmm0
vmulss xmm3,xmm1,xmm0
vmulss xmm2,xmm1,real4 ptr [F_HALF] ;1/2 * rsqrt(x)
vfmsub231ss xmm0,xmm3,xmm3
vfnmadd231ss xmm3,xmm0,xmm2
vcvttss2usi rax,xmm3
vs
vcvtusi2sd xmm0,xmm0,rcx
vsqrtpd xmm1,xmm0 ;18/12 cycles. expensive function.
vcvttsd2usi rax,xmm1
Thanks infiniteloop
What is best depends on purpose
Fastest enough good for restrict divisions in prime test loop
and sqrps good for 2d round light
Quote from: InfiniteLoop on February 01, 2025, 09:30:27 PMvcvtusi2ss xmm0,xmm0,rcx ;unsigned long long to float
Crashes with "illegal instruction" on my relatively new cpu
apparently the unsigned instructions vcvtusi2ss and vcvttsd2usi are AVX512 instructions