The MASM Forum

General => The Laboratory => Topic started by: jj2007 on December 30, 2017, 09:25:57 PM

Title: fsqrt vs rsqrtss
Post by: jj2007 on December 30, 2017, 09:25:57 PM
Category "almost unknown SIMD instructions" ;)
Intel(R) Core(TM) i5-2450M CPU @ 2.50GHz (SSE4)

5703    cycles for 100 * fsqrt
4380    cycles for 100 * rsqrtss

5684    cycles for 100 * fsqrt
4436    cycles for 100 * rsqrtss

5683    cycles for 100 * fsqrt
4435    cycles for 100 * rsqrtss

5646    cycles for 100 * fsqrt
4420    cycles for 100 * rsqrtss

5699    cycles for 100 * fsqrt
4396    cycles for 100 * rsqrtss

87      bytes for fsqrt
97      bytes for rsqrtss

9483.79199 = fsqrt
9483.57227 = rsqrtss


RSQRTSS--Scalar Single-Precision Floating-Point Square Root Reciprocal
Computes an approximate reciprocal of the square root of the low single-precision floating-point value in the source operand (second operand), stores the single-precision floating-point result in the destination operand. The maximum error for this approximation is (1.5 * 2-12). The source operand can be an XMM register or a 32-bit memory location. The destination operand is an XMM register.
Title: Re: fsqrt vs rsqrtss
Post by: Mikl__ on December 30, 2017, 10:47:53 PM
Ciao, jj2007!
There is soon a new year, so I'll a little flood (https://wasm.in/styles/smiles_s/smile3.gif)
It would seem that it could be simpler than fsqrt, but we'll go the other way
1) The best-known integer algorithm for calculating the square root of a number is striking in its simplicity
unsigned sqrt_cpu_int(usigned X)
{ unsigned div = 1, result = 0;
         while (X > 0)
         { X -= div; div += 2; result += X < 0 ? 0 : 1; }
          return result;
}
The disadvantage of this algorithm is that the number of iterations will increase with increasing X
2) calculation of the square root of the Taylor series expansion.
Let X be any number; f(X) is a function depending on X; a is a known number close to X; f(a) is a known value of the function.
We expand f(X) in a Taylor series:
f(X)=f(a)+(X-a)f'(a)+((X-a)²∙f"(a))/2!+...+((X-a)n∙fn(a))/n!
Let X be the number from which to extract the root. Then f(X)=√X; a is a known number close to X; f(a)=√a is a known number close to √X, and f(X)=√a+(X-a)/(2√a)+...=(2a+X-a)/(2√a)=(a+X)/(2√a)
The value √X can be found by setting √a and then calculating f(X). f(X)² can be compared to the original number X. If the accuracy is insufficient, then the number a is replaced by f(X)², √a by f(X) and the calculation is repeated
3) the search for an integer square root by Newton's method begins with a certain value of g0, which is the initial estimate √X. Then a series of refinements of the value of the square root is performed by the formula gn+1 = (gn + X/gn)/2. To reduce the number of iterations, you can, in the first step, select the initial value for the variable g0{usigned x1;
int a, g0, g1;
if (x <= 1) return x;
a = 1;
x1 = x-1;
if (x1> 65535) {a = a + 8; x1 = x1 >> 16;}
if (x1> 255) {a = a + 4; x1 = x1 >> 8;}
if (x1> 15) {a = a + 2; x1 = x1 >> 4;}
if (x1> 3) {a = a + 1;}
g0 = 1 << a; // g0 = 2 //
g1 = (g0 + (x >> a)) >> 1;
while (g1 <g0) // we repeat, while the approximations are strictly reduced
{g0 = g1; g1 = (g0 + (x/g0)) >> 1}
}
4) Since the division is a fairly slow operation, you can abandon it. To calculate the square root, we use multiplication. A 32-bit X number will have a 16-bit square root. At each step, the 1 bit of the root value is refined. For the acceleration, the calculation "without jumps" is made to add or subtract the value in the corresponding digit mov ebx, 4000h
     mov ecx, 8000h
     mov eax, 40000000h
@@: cmp eax, X
     je @f; premature exit from the cycle
     sbb edx, edx; if CF = 1 then edx = 0FFFFFFFFh otherwise edx = 0
     sub ecx, ebx
     and edx, ebx; if CF = 1 then edx = ebx otherwise edx = 0
     lea ecx, [ecx + edx * 2]; if CF = 1 then ecx = ecx + ebx otherwise ecx = ecx-ebx
     shr ebx, 1; move to the next level
     mov eax, ecx
     mul eax; we get "eax squared"
     test ebx, ebx
     jnz @b
@@: mov eax, ecx; eax: = sqrt (X)
It is not clear how, neither division, nor multiplication, but it works! And the algorithm was described already in 1945 in Von Neumann J. "First Draft of a Reaport on the EDVAC"    mov ebp,X
    mov ecx,40000000h
    xor eax,eax
@@: mov ebx,eax
    or ebx,ecx
    shr eax,1
    cmp ebp,ebx
    sbb edx,edx
    not edx
    mov edi,edx
    and edx,ebx
    and edi,ecx
    sub ebp,edx
    or eax,edi
    shr ecx,2
    test ecx,ecx
    jnz @b
;eax=sqrt(X)
Ormov ebp, X
         inc ebp
     bsr ecx, ebp
     and ecx, -2
     mov ebx, 1
     shl ebx, cl; to reduce the number of iterations
         xor eax, eax
@@: lea ecx, [eax + ebx]
     shr eax, 1
     cmp ecx, ebp
     sbb edx, edx
     mov edi, edx
     and edx, ecx
     sub ebp, edx
     and edi, ebx
     or eax, edi
     shr ebx, 2
     jnz @b
maybe these algorithms will be faster than fsqrt with 5703 cycles for 100
Title: Re: fsqrt vs rsqrtss
Post by: Gunther on December 31, 2017, 12:40:17 AM
Hi Mikl__,

You have set out a very good overview of possible alternative algorithms.  :t Here (https://www.codeproject.com/Articles/69941/Best-Square-Root-Method-Algorithm-Function-Precisi) you will find an interesting article on these questions. This is done in the high level language, but it should not be a problem to realize that in assembly language.

Gunther
Title: Re: fsqrt vs rsqrtss
Post by: jj2007 on December 31, 2017, 12:57:13 AM
If you come up with a convincing MASM algo, I can add it :bgrin:

(I doubt it will be any faster - if it's a good algo, the Intel guys realised it in micro-ops)
Title: Re: fsqrt vs rsqrtss
Post by: Mikl__ on December 31, 2017, 01:11:53 AM
Hallo, Gunther!
Danke für den interessanten Artikel! Ich werde definitiv in Assembler-Sprache übersetzen und die empfangenen Programme auf Geschwindigkeit prüfen. Glückliches neues Jahr!
Title: Re: fsqrt vs rsqrtss
Post by: Gunther on December 31, 2017, 01:35:52 AM
Mikl__,

Желаю вам хорошего здоровья и счастья в следующем году.

Gunther
Title: Re: fsqrt vs rsqrtss
Post by: aw27 on December 31, 2017, 03:06:00 AM
JJ,
Why don't you mention SQRTSS or SQRTPS? Do you prefer reciprocals?  :shock:
Title: Re: fsqrt vs rsqrtss
Post by: jj2007 on December 31, 2017, 04:01:58 AM
Quote from: aw27 on December 31, 2017, 03:06:00 AMWhy don't you mention SQRTSS or SQRTPS? Do you prefer reciprocals?  :shock:

Good idea:Intel(R) Core(TM) i5-2450M CPU @ 2.50GHz (SSE4)

5536    cycles for 100 * fsqrt
4300    cycles for 100 * rsqrtss
4638    cycles for 100 * sqrtss
5639    cycles for 100 * sqrtps

5532    cycles for 100 * fsqrt
4301    cycles for 100 * rsqrtss
4638    cycles for 100 * sqrtss
5641    cycles for 100 * sqrtps

5538    cycles for 100 * fsqrt
4301    cycles for 100 * rsqrtss
4640    cycles for 100 * sqrtss
5674    cycles for 100 * sqrtps

5555    cycles for 100 * fsqrt
4304    cycles for 100 * rsqrtss
4640    cycles for 100 * sqrtss
5709    cycles for 100 * sqrtps

5548    cycles for 100 * fsqrt
4305    cycles for 100 * rsqrtss
4641    cycles for 100 * sqrtss
5675    cycles for 100 * sqrtps

87      bytes for fsqrt
97      bytes for rsqrtss
89      bytes for sqrtss
87      bytes for sqrtps

9483.79199 = fsqrt
9483.57227 = rsqrtss
9483.79199 = sqrtss
1.16594337e+09 = sqrtps

The last result looks wrong, though ::)

Rand(0.0, 123456789.0, MyReal4)
sqrtps xmm0, oword ptr MyReal4
movaps oword ptr MyReal4, xmm0
Title: Re: fsqrt vs rsqrtss
Post by: Mikl__ on December 31, 2017, 11:42:10 AM
Buon anno, jj! (https://wasm.in/styles/smiles_s/party.gif)
Intel(R) Core(TM) i3-3210 CPU @ 3.20GHz (SSE4)

6647 cycles for 100 * fsqrt
5242 cycles for 100 * rsqrtss
5549 cycles for 100 * sqrtss
6665 cycles for 100 * sqrtps

6685 cycles for 100 * fsqrt
5245 cycles for 100 * rsqrtss
5544 cycles for 100 * sqrtss
6653 cycles for 100 * sqrtps

6657 cycles for 100 * fsqrt
5253 cycles for 100 * rsqrtss
5562 cycles for 100 * sqrtss
6653 cycles for 100 * sqrtps

6775 cycles for 100 * fsqrt
5310 cycles for 100 * rsqrtss
5543 cycles for 100 * sqrtss
6654 cycles for 100 * sqrtps

6655 cycles for 100 * fsqrt
5251 cycles for 100 * rsqrtss
5550 cycles for 100 * sqrtss
6647 cycles for 100 * sqrtps

87 bytes for fsqrt
97 bytes for rsqrtss
89 bytes for sqrtss
87 bytes for sqrtps

9483.79199 = fsqrt
9483.57227 = rsqrtss
9483.79199 = sqrtss
1.16594337e+09 = sqrtps
Title: Re: fsqrt vs rsqrtss
Post by: jj2007 on December 31, 2017, 12:06:28 PM
Buon anno a te, Mikl :icon14:
Title: Re: fsqrt vs rsqrtss
Post by: LordAdef on December 31, 2017, 02:32:15 PM
Is "oword" a typo, a Johen's invention, or something I'm about to learn? :dazzled:
Title: Re: fsqrt vs rsqrtss
Post by: felipe on December 31, 2017, 02:44:47 PM
oword = 128 bits. Or 16 bytes. It's a data identifier like qword, dword, etc.
Title: Re: fsqrt vs rsqrtss
Post by: hutch-- on December 31, 2017, 04:11:22 PM
MASM likes XMMWORD and YMMWORD. I think an earlier version of 32 bit MASM used OWORD.
Title: Re: fsqrt vs rsqrtss
Post by: LordAdef on December 31, 2017, 06:59:52 PM
thanks Felipe and Hutch, this one is new to me!
Title: Re: fsqrt vs rsqrtss
Post by: daydreamer on December 31, 2017, 08:15:43 PM
Quote from: Mikl__ on December 30, 2017, 10:47:53 PM
Ciao, jj2007!
There is soon a new year, so I'll a little flood (https://wasm.in/styles/smiles_s/smile3.gif)
It would seem that it could be simpler than fsqrt, but we'll go the other way
1) The best-known integer algorithm for calculating the square root of a number is striking in its simplicity
unsigned sqrt_cpu_int(usigned X)
{ unsigned div = 1, result = 0;
         while (X > 0)
         { X -= div; div += 2; result += X < 0 ? 0 : 1; }
          return result;
}
The disadvantage of this algorithm is that the number of iterations will increase with increasing X
2) calculation of the square root of the Taylor series expansion.
Let X be any number; f(X) is a function depending on X; a is a known number close to X; f(a) is a known value of the function.
We expand f(X) in a Taylor series:
f(X)=f(a)+(X-a)f'(a)+((X-a)²∙f"(a))/2!+...+((X-a)n∙fn(a))/n!
Let X be the number from which to extract the root. Then f(X)=√X; a is a known number close to X; f(a)=√a is a known number close to √X, and f(X)=√a+(X-a)/(2√a)+...=(2a+X-a)/(2√a)=(a+X)/(2√a)
The value √X can be found by setting √a and then calculating f(X). f(X)² can be compared to the original number X. If the accuracy is insufficient, then the number a is replaced by f(X)², √a by f(X) and the calculation is repeated
3) the search for an integer square root by Newton's method begins with a certain value of g0, which is the initial estimate √X. Then a series of refinements of the value of the square root is performed by the formula gn+1 = (gn + X/gn)/2. To reduce the number of iterations, you can, in the first step, select the initial value for the variable g0{usigned x1;
int a, g0, g1;
if (x <= 1) return x;
a = 1;
x1 = x-1;
if (x1> 65535) {a = a + 8; x1 = x1 >> 16;}
if (x1> 255) {a = a + 4; x1 = x1 >> 8;}
if (x1> 15) {a = a + 2; x1 = x1 >> 4;}
if (x1> 3) {a = a + 1;}
g0 = 1 << a; // g0 = 2 //
g1 = (g0 + (x >> a)) >> 1;
while (g1 <g0) // we repeat, while the approximations are strictly reduced
{g0 = g1; g1 = (g0 + (x/g0)) >> 1}
}
4) Since the division is a fairly slow operation, you can abandon it. To calculate the square root, we use multiplication. A 32-bit X number will have a 16-bit square root. At each step, the 1 bit of the root value is refined. For the acceleration, the calculation "without jumps" is made to add or subtract the value in the corresponding digit mov ebx, 4000h
     mov ecx, 8000h
     mov eax, 40000000h
@@: cmp eax, X
     je @f; premature exit from the cycle
     sbb edx, edx; if CF = 1 then edx = 0FFFFFFFFh otherwise edx = 0
     sub ecx, ebx
     and edx, ebx; if CF = 1 then edx = ebx otherwise edx = 0
     lea ecx, [ecx + edx * 2]; if CF = 1 then ecx = ecx + ebx otherwise ecx = ecx-ebx
     shr ebx, 1; move to the next level
     mov eax, ecx
     mul eax; we get "eax squared"
     test ebx, ebx
     jnz @b
@@: mov eax, ecx; eax: = sqrt (X)
It is not clear how, neither division, nor multiplication, but it works! And the algorithm was described already in 1945 in Von Neumann J. "First Draft of a Reaport on the EDVAC"    mov ebp,X
    mov ecx,40000000h
    xor eax,eax
@@: mov ebx,eax
    or ebx,ecx
    shr eax,1
    cmp ebp,ebx
    sbb edx,edx
    not edx
    mov edi,edx
    and edx,ebx
    and edi,ecx
    sub ebp,edx
    or eax,edi
    shr ecx,2
    test ecx,ecx
    jnz @b
;eax=sqrt(X)
Ormov ebp, X
         inc ebp
     bsr ecx, ebp
     and ecx, -2
     mov ebx, 1
     shl ebx, cl; to reduce the number of iterations
         xor eax, eax
@@: lea ecx, [eax + ebx]
     shr eax, 1
     cmp ecx, ebp
     sbb edx, edx
     mov edi, edx
     and edx, ecx
     sub ebp, edx
     and edi, ebx
     or eax, edi
     shr ebx, 2
     jnz @b
maybe these algorithms will be faster than fsqrt with 5703 cycles for 100
would be interesting to code it with help of SSE2 integer,think you could call different versions of same algo depending on size,,Think the speed if all variables are byte sized in a 128bit XMM regs
Title: Re: fsqrt vs rsqrtss
Post by: Mikl__ on December 31, 2017, 09:45:58 PM
Hi, daydreamer2!
and to you "happy new year!" While there are New Year holidays, I'll think about applying SSE2 in calculating the square root (http://kolobok.us/smiles/standart/drinks.gif)
(http://forumsmile.ru/u/a/5/4/a54424ca622a643b0c8f5b5636e3fcff.png)
Title: Re: fsqrt vs rsqrtss
Post by: daydreamer on January 01, 2018, 01:02:29 AM
Quote from: Mikl__ on December 31, 2017, 09:45:58 PM
Hi, daydreamer2!
and to you "happy new year!" While there are New Year holidays, I'll think about applying SSE2 in calculating the square root (http://kolobok.us/smiles/standart/drinks.gif)
(http://forumsmile.ru/u/a/5/4/a54424ca622a643b0c8f5b5636e3fcff.png)
happy new year Miki :bgrin:
Title: Re: fsqrt vs rsqrtss
Post by: daydreamer on January 03, 2025, 12:32:00 AM
hi
want this timings as research for prime test optimisation
starting with my cpu
also got reminded by Zedd that some members has got newer computers after this test
also interested in div,fdiv,divss,divps,divsd,divpd timings
also interested in mul,fmul,mulss,mulps,mulsd,mulpd timings

Intel(R) Core(TM) i5-7200U CPU @ 2.50GHz (SSE4)

5432    cycles for 100 * fsqrt
4205    cycles for 100 * rsqrtss
4559    cycles for 100 * sqrtss
5435    cycles for 100 * sqrtps

5392    cycles for 100 * fsqrt
4204    cycles for 100 * rsqrtss
4553    cycles for 100 * sqrtss
5450    cycles for 100 * sqrtps

5385    cycles for 100 * fsqrt
4215    cycles for 100 * rsqrtss
4551    cycles for 100 * sqrtss
5435    cycles for 100 * sqrtps

5386    cycles for 100 * fsqrt
4198    cycles for 100 * rsqrtss
4557    cycles for 100 * sqrtss
5451    cycles for 100 * sqrtps

5407    cycles for 100 * fsqrt
4203    cycles for 100 * rsqrtss
4553    cycles for 100 * sqrtss
5449    cycles for 100 * sqrtps

87      bytes for fsqrt
97      bytes for rsqrtss
89      bytes for sqrtss
87      bytes for sqrtps

9483.79199 = fsqrt
9483.57227 = rsqrtss
9483.79199 = sqrtss
1.16594337e+09 = sqrtps

-
Title: Re: fsqrt vs rsqrtss
Post by: zedd151 on January 03, 2025, 03:23:11 AM
Per Magnus's suggestion...

Dell Optiplex 7050 SFF PC:
Intel(R) Core(TM) i7-7700 CPU @ 3.60GHz (SSE4)

5367    cycles for 100 * fsqrt
4180    cycles for 100 * rsqrtss
4563    cycles for 100 * sqrtss
5435    cycles for 100 * sqrtps

5364    cycles for 100 * fsqrt
4173    cycles for 100 * rsqrtss
4527    cycles for 100 * sqrtss
5384    cycles for 100 * sqrtps

5366    cycles for 100 * fsqrt
4178    cycles for 100 * rsqrtss
4543    cycles for 100 * sqrtss
5475    cycles for 100 * sqrtps

5464    cycles for 100 * fsqrt
4241    cycles for 100 * rsqrtss
4569    cycles for 100 * sqrtss
5499    cycles for 100 * sqrtps

5410    cycles for 100 * fsqrt
4200    cycles for 100 * rsqrtss
4561    cycles for 100 * sqrtss
5424    cycles for 100 * sqrtps

87      bytes for fsqrt
97      bytes for rsqrtss
89      bytes for sqrtss
87      bytes for sqrtps

9483.79199 = fsqrt
9483.57227 = rsqrtss
9483.79199 = sqrtss
1.16594337e+09 = sqrtps

OTVOC 15 laptop:
Intel(R) Celeron(R) N5105 @ 2.00GHz (SSE4)

4494    cycles for 100 * fsqrt
3734    cycles for 100 * rsqrtss
3788    cycles for 100 * sqrtss
4906    cycles for 100 * sqrtps

4481    cycles for 100 * fsqrt
3709    cycles for 100 * rsqrtss
3835    cycles for 100 * sqrtss
4708    cycles for 100 * sqrtps

4462    cycles for 100 * fsqrt
3575    cycles for 100 * rsqrtss
3904    cycles for 100 * sqrtss
4838    cycles for 100 * sqrtps

4536    cycles for 100 * fsqrt
3637    cycles for 100 * rsqrtss
3879    cycles for 100 * sqrtss
4673    cycles for 100 * sqrtps

4409    cycles for 100 * fsqrt
3513    cycles for 100 * rsqrtss
3748    cycles for 100 * sqrtss
4649    cycles for 100 * sqrtps

87      bytes for fsqrt
97      bytes for rsqrtss
89      bytes for sqrtss
87      bytes for sqrtps

9483.79199 = fsqrt
9483.57227 = rsqrtss
9483.79199 = sqrtss
1.16594337e+09 = sqrtps

:cool:   Not too shabby, eh?
Title: Re: fsqrt vs rsqrtss
Post by: daydreamer on January 03, 2025, 09:00:59 PM
Thanks,looks good zedd
Title: Re: fsqrt vs rsqrtss
Post by: daydreamer on January 11, 2025, 04:41:50 AM
fastest sqrt ever  :biggrin:
Title: Re: fsqrt vs rsqrtss
Post by: zedd151 on January 11, 2025, 05:40:45 AM
Quote from: daydreamer on January 11, 2025, 04:41:50 AMfastest sqrt ever  :biggrin:

That is a bold statement, Magnus. That sounds like a challenge. :biggrin:
We'll see... I'll look at that later. :cool:


A few short minutes later...
"shape2.com"  :rolleyes:

Okay, lemme get dosbox...
(https://i.postimg.cc/9Mq4yZMH/SHAPE2.png)

Is that what it is supposed to look like? (its the final screen)
Title: Re: fsqrt vs rsqrtss
Post by: daydreamer on January 26, 2025, 08:05:51 PM
Its very fast drawing 4 tiles rotated  90 degrees
Inspired by pixelshaders pixel drawing things with formula per pixel
After that developed straight road,sinus road,45 degree road and scroll thru all 640kb
It was beginning of develop demo that drawed a racetrack or river

Sqrt For purpose of limit prime division test,a sqrt integer proc also has the bonus of detecting some non primes :
Sqrt(4),sqrt(16),sqrt(25),sqrt(36),sqrt(100) ... 0 fraction = can skip division loop entirely

Sqrt only for integer in prime test loop :
;high byte lut only approximate sqrt
;low byte lut sqrt nearest integer

If eax>256
Mov al,ah
Xlat high byte lut
Else
Xlat low byte out
End if

I bet this could be fast



Title: Re: fsqrt vs rsqrtss
Post by: InfiniteLoop on February 01, 2025, 09:30:27 PM
Which is better?
The 1st uses fewer bytes, the 2nd uses fewer cycles, the 3rd is 100% accurate.
That's enough effort spent on square roots for now.

vcvtusi2ss xmm0,xmm0,rcx    ;unsigned long long to float
vrsqrt14ss xmm1,xmm0,xmm0
vmulss xmm3,xmm1,xmm0
vfmsub231ss xmm0,xmm3,xmm3
vaddss xmm2,xmm3,xmm3
vrcpss xmm2,xmm2,xmm2
vfnmadd231ss xmm3,xmm0,xmm2
vcvttss2usi rax,xmm3

vs

vcvtusi2ss xmm0,xmm0,rcx      ;unsigned long long to float
vrsqrt14ss xmm1,xmm0,xmm0
vmulss xmm3,xmm1,xmm0
vmulss xmm2,xmm1,real4 ptr [F_HALF]   ;1/2 * rsqrt(x)
vfmsub231ss xmm0,xmm3,xmm3
vfnmadd231ss xmm3,xmm0,xmm2
vcvttss2usi rax,xmm3

vs

vcvtusi2sd xmm0,xmm0,rcx
vsqrtpd xmm1,xmm0         ;18/12 cycles. expensive function.
vcvttsd2usi rax,xmm1
Title: Re: fsqrt vs rsqrtss
Post by: daydreamer on February 02, 2025, 01:27:21 AM
Thanks infiniteloop
What is best depends on purpose
Fastest enough good for restrict divisions in prime test loop
and sqrps good for 2d round light
Title: Re: fsqrt vs rsqrtss
Post by: jj2007 on February 02, 2025, 08:42:40 AM
Quote from: InfiniteLoop on February 01, 2025, 09:30:27 PMvcvtusi2ss xmm0,xmm0,rcx    ;unsigned long long to float

Crashes with "illegal instruction" on my relatively new cpu
Title: Re: fsqrt vs rsqrtss
Post by: jack on February 02, 2025, 11:21:17 AM
apparently the unsigned instructions vcvtusi2ss and vcvttsd2usi are AVX512 instructions