### Author Topic: fsqrt vs rsqrtss  (Read 2776 times)

#### jj2007

• Member
•     • • Posts: 9401
• Assembler is fun ;-) ##### fsqrt vs rsqrtss
« on: December 30, 2017, 09:25:57 PM »
Category "almost unknown SIMD instructions" ;)
Code: [Select]
`Intel(R) Core(TM) i5-2450M CPU @ 2.50GHz (SSE4)5703    cycles for 100 * fsqrt4380    cycles for 100 * rsqrtss5684    cycles for 100 * fsqrt4436    cycles for 100 * rsqrtss5683    cycles for 100 * fsqrt4435    cycles for 100 * rsqrtss5646    cycles for 100 * fsqrt4420    cycles for 100 * rsqrtss5699    cycles for 100 * fsqrt4396    cycles for 100 * rsqrtss87      bytes for fsqrt97      bytes for rsqrtss9483.79199 = fsqrt9483.57227 = rsqrtss`
RSQRTSS--Scalar Single-Precision Floating-Point Square Root Reciprocal
Computes an approximate reciprocal of the square root of the low single-precision floating-point value in the source operand (second operand), stores the single-precision floating-point result in the destination operand. The maximum error for this approximation is (1.5 * 2-12). The source operand can be an XMM register or a 32-bit memory location. The destination operand is an XMM register.

#### Mikl__

• Member
•    • Posts: 743 ##### Re: fsqrt vs rsqrtss
« Reply #1 on: December 30, 2017, 10:47:53 PM »
Ciao, jj2007!
There is soon a new year, so I'll a little flood It would seem that it could be simpler than fsqrt, but we'll go the other way
1) The best-known integer algorithm for calculating the square root of a number is striking in its simplicity
Code: [Select]
`unsigned sqrt_cpu_int(usigned X){ unsigned div = 1, result = 0;          while (X > 0)         { X -= div; div += 2; result += X < 0 ? 0 : 1; }          return result; }`The disadvantage of this algorithm is that the number of iterations will increase with increasing X
2) calculation of the square root of the Taylor series expansion.
Let X be any number; f(X) is a function depending on X; a is a known number close to X; f(a) is a known value of the function.
We expand f(X) in a Taylor series:
f(X)=f(a)+(X-a)f'(a)+((X-a)²∙f"(a))/2!+...+((X-a)n∙fn(a))/n!
Let X be the number from which to extract the root. Then f(X)=√X; a is a known number close to X; f(a)=√a is a known number close to √X, and f(X)=√a+(X-a)/(2√a)+...=(2a+X-a)/(2√a)=(a+X)/(2√a)
The value √X can be found by setting √a and then calculating f(X). f(X)² can be compared to the original number X. If the accuracy is insufficient, then the number a is replaced by f(X)², √a by f(X) and the calculation is repeated
3) the search for an integer square root by Newton's method begins with a certain value of g0, which is the initial estimate √X. Then a series of refinements of the value of the square root is performed by the formula gn+1 = (gn + X/gn)/2. To reduce the number of iterations, you can, in the first step, select the initial value for the variable g0
Code: [Select]
`{usigned x1;int a, g0, g1;if (x <= 1) return x;a = 1;x1 = x-1;if (x1> 65535) {a = a + 8; x1 = x1 >> 16;}if (x1> 255) {a = a + 4; x1 = x1 >> 8;}if (x1> 15) {a = a + 2; x1 = x1 >> 4;}if (x1> 3) {a = a + 1;}g0 = 1 << a; // g0 = 2 //g1 = (g0 + (x >> a)) >> 1;while (g1 <g0) // we repeat, while the approximations are strictly reduced{g0 = g1; g1 = (g0 + (x/g0)) >> 1}}`4) Since the division is a fairly slow operation, you can abandon it. To calculate the square root, we use multiplication. A 32-bit X number will have a 16-bit square root. At each step, the 1 bit of the root value is refined. For the acceleration, the calculation "without jumps" is made to add or subtract the value in the corresponding digit
Code: [Select]
`mov ebx, 4000h     mov ecx, 8000h     mov eax, 40000000h@@: cmp eax, X     je @f; premature exit from the cycle     sbb edx, edx; if CF = 1 then edx = 0FFFFFFFFh otherwise edx = 0     sub ecx, ebx     and edx, ebx; if CF = 1 then edx = ebx otherwise edx = 0     lea ecx, [ecx + edx * 2]; if CF = 1 then ecx = ecx + ebx otherwise ecx = ecx-ebx     shr ebx, 1; move to the next level     mov eax, ecx     mul eax; we get "eax squared"     test ebx, ebx     jnz @b@@: mov eax, ecx; eax: = sqrt (X)`It is not clear how, neither division, nor multiplication, but it works! And the algorithm was described already in 1945 in Von Neumann J. "First Draft of a Reaport on the EDVAC"
Code: [Select]
`    mov ebp,X    mov ecx,40000000h    xor eax,eax@@: mov ebx,eax    or ebx,ecx    shr eax,1    cmp ebp,ebx    sbb edx,edx    not edx    mov edi,edx    and edx,ebx    and edi,ecx    sub ebp,edx    or eax,edi    shr ecx,2    test ecx,ecx    jnz @b;eax=sqrt(X)`Or
Code: [Select]
`mov ebp, X         inc ebp     bsr ecx, ebp     and ecx, -2     mov ebx, 1     shl ebx, cl; to reduce the number of iterations         xor eax, eax@@: lea ecx, [eax + ebx]     shr eax, 1     cmp ecx, ebp     sbb edx, edx     mov edi, edx     and edx, ecx     sub ebp, edx     and edi, ebx     or eax, edi     shr ebx, 2     jnz @b`maybe these algorithms will be faster than fsqrt with 5703 cycles for 100
« Last Edit: December 31, 2017, 11:35:34 AM by Mikl__ »

#### Gunther

• Member
•     • • Posts: 3585
• Forgive your enemies, but never forget their names ##### Re: fsqrt vs rsqrtss
« Reply #2 on: December 31, 2017, 12:40:17 AM »
Hi Mikl__,

You have set out a very good overview of possible alternative algorithms.  :t Here you will find an interesting article on these questions. This is done in the high level language, but it should not be a problem to realize that in assembly language.

Gunther
Get your facts first, and then you can distort them.

#### jj2007

• Member
•     • • Posts: 9401
• Assembler is fun ;-) ##### Re: fsqrt vs rsqrtss
« Reply #3 on: December 31, 2017, 12:57:13 AM »
If you come up with a convincing MASM algo, I can add it (I doubt it will be any faster - if it's a good algo, the Intel guys realised it in micro-ops)

#### Mikl__

• Member
•    • Posts: 743 ##### Re: fsqrt vs rsqrtss
« Reply #4 on: December 31, 2017, 01:11:53 AM »
Hallo, Gunther!
Danke für den interessanten Artikel! Ich werde definitiv in Assembler-Sprache übersetzen und die empfangenen Programme auf Geschwindigkeit prüfen. Glückliches neues Jahr!

#### Gunther

• Member
•     • • Posts: 3585
• Forgive your enemies, but never forget their names ##### Re: fsqrt vs rsqrtss
« Reply #5 on: December 31, 2017, 01:35:52 AM »
Mikl__,

Желаю вам хорошего здоровья и счастья в следующем году.

Gunther
Get your facts first, and then you can distort them.

#### AW

• Member
•     • • Posts: 1969
• Let's Make ASM Great Again! ##### Re: fsqrt vs rsqrtss
« Reply #6 on: December 31, 2017, 03:06:00 AM »
JJ,
Why don't you mention SQRTSS or SQRTPS? Do you prefer reciprocals?  :shock:

#### jj2007

• Member
•     • • Posts: 9401
• Assembler is fun ;-) ##### Re: fsqrt vs rsqrtss
« Reply #7 on: December 31, 2017, 04:01:58 AM »
Why don't you mention SQRTSS or SQRTPS? Do you prefer reciprocals?  :shock:

Good idea:
Code: [Select]
`Intel(R) Core(TM) i5-2450M CPU @ 2.50GHz (SSE4)5536    cycles for 100 * fsqrt4300    cycles for 100 * rsqrtss4638    cycles for 100 * sqrtss5639    cycles for 100 * sqrtps5532    cycles for 100 * fsqrt4301    cycles for 100 * rsqrtss4638    cycles for 100 * sqrtss5641    cycles for 100 * sqrtps5538    cycles for 100 * fsqrt4301    cycles for 100 * rsqrtss4640    cycles for 100 * sqrtss5674    cycles for 100 * sqrtps5555    cycles for 100 * fsqrt4304    cycles for 100 * rsqrtss4640    cycles for 100 * sqrtss5709    cycles for 100 * sqrtps5548    cycles for 100 * fsqrt4305    cycles for 100 * rsqrtss4641    cycles for 100 * sqrtss5675    cycles for 100 * sqrtps87      bytes for fsqrt97      bytes for rsqrtss89      bytes for sqrtss87      bytes for sqrtps9483.79199 = fsqrt9483.57227 = rsqrtss9483.79199 = sqrtss1.16594337e+09 = sqrtps`The last result looks wrong, though ::)

Code: [Select]
` Rand(0.0, 123456789.0, MyReal4) sqrtps xmm0, oword ptr MyReal4 movaps oword ptr MyReal4, xmm0`

#### Mikl__

• Member
•    • Posts: 743 ##### Re: fsqrt vs rsqrtss
« Reply #8 on: December 31, 2017, 11:42:10 AM »
Buon anno, jj! Code: [Select]
`Intel(R) Core(TM) i3-3210 CPU @ 3.20GHz (SSE4)6647 cycles for 100 * fsqrt5242 cycles for 100 * rsqrtss5549 cycles for 100 * sqrtss6665 cycles for 100 * sqrtps6685 cycles for 100 * fsqrt5245 cycles for 100 * rsqrtss5544 cycles for 100 * sqrtss6653 cycles for 100 * sqrtps6657 cycles for 100 * fsqrt5253 cycles for 100 * rsqrtss5562 cycles for 100 * sqrtss6653 cycles for 100 * sqrtps6775 cycles for 100 * fsqrt5310 cycles for 100 * rsqrtss5543 cycles for 100 * sqrtss6654 cycles for 100 * sqrtps6655 cycles for 100 * fsqrt5251 cycles for 100 * rsqrtss5550 cycles for 100 * sqrtss6647 cycles for 100 * sqrtps87 bytes for fsqrt97 bytes for rsqrtss89 bytes for sqrtss87 bytes for sqrtps9483.79199 = fsqrt9483.57227 = rsqrtss9483.79199 = sqrtss1.16594337e+09 = sqrtps`

#### jj2007

• Member
•     • • Posts: 9401
• Assembler is fun ;-) ##### Re: fsqrt vs rsqrtss
« Reply #9 on: December 31, 2017, 12:06:28 PM »
Buon anno a te, Mikl :icon14:

#### LordAdef ##### Re: fsqrt vs rsqrtss
« Reply #10 on: December 31, 2017, 02:32:15 PM »
Is "oword" a typo, a Johen's invention, or something I'm about to learn? #### felipe

• Member
•     • • Posts: 1164
• Eagles are just great! ##### Re: fsqrt vs rsqrtss
« Reply #11 on: December 31, 2017, 02:44:47 PM »
oword = 128 bits. Or 16 bytes. It's a data identifier like qword, dword, etc.
Felipe.

#### hutch--

• Administrator
• Member
•      • • Posts: 6279
• Mnemonic Driven API Grinder ##### Re: fsqrt vs rsqrtss
« Reply #12 on: December 31, 2017, 04:11:22 PM »
MASM likes XMMWORD and YMMWORD. I think an earlier version of 32 bit MASM used OWORD.
hutch at movsd dot com
http://www.masm32.com  #### LordAdef ##### Re: fsqrt vs rsqrtss
« Reply #13 on: December 31, 2017, 06:59:52 PM »
thanks Felipe and Hutch, this one is new to me!

#### daydreamer ##### Re: fsqrt vs rsqrtss
« Reply #14 on: December 31, 2017, 08:15:43 PM »
Ciao, jj2007!
There is soon a new year, so I'll a little flood It would seem that it could be simpler than fsqrt, but we'll go the other way
1) The best-known integer algorithm for calculating the square root of a number is striking in its simplicity
Code: [Select]
`unsigned sqrt_cpu_int(usigned X){ unsigned div = 1, result = 0;          while (X > 0)         { X -= div; div += 2; result += X < 0 ? 0 : 1; }          return result; }`The disadvantage of this algorithm is that the number of iterations will increase with increasing X
2) calculation of the square root of the Taylor series expansion.
Let X be any number; f(X) is a function depending on X; a is a known number close to X; f(a) is a known value of the function.
We expand f(X) in a Taylor series:
f(X)=f(a)+(X-a)f'(a)+((X-a)²∙f"(a))/2!+...+((X-a)n∙fn(a))/n!
Let X be the number from which to extract the root. Then f(X)=√X; a is a known number close to X; f(a)=√a is a known number close to √X, and f(X)=√a+(X-a)/(2√a)+...=(2a+X-a)/(2√a)=(a+X)/(2√a)
The value √X can be found by setting √a and then calculating f(X). f(X)² can be compared to the original number X. If the accuracy is insufficient, then the number a is replaced by f(X)², √a by f(X) and the calculation is repeated
3) the search for an integer square root by Newton's method begins with a certain value of g0, which is the initial estimate √X. Then a series of refinements of the value of the square root is performed by the formula gn+1 = (gn + X/gn)/2. To reduce the number of iterations, you can, in the first step, select the initial value for the variable g0
Code: [Select]
`{usigned x1;int a, g0, g1;if (x <= 1) return x;a = 1;x1 = x-1;if (x1> 65535) {a = a + 8; x1 = x1 >> 16;}if (x1> 255) {a = a + 4; x1 = x1 >> 8;}if (x1> 15) {a = a + 2; x1 = x1 >> 4;}if (x1> 3) {a = a + 1;}g0 = 1 << a; // g0 = 2 //g1 = (g0 + (x >> a)) >> 1;while (g1 <g0) // we repeat, while the approximations are strictly reduced{g0 = g1; g1 = (g0 + (x/g0)) >> 1}}`4) Since the division is a fairly slow operation, you can abandon it. To calculate the square root, we use multiplication. A 32-bit X number will have a 16-bit square root. At each step, the 1 bit of the root value is refined. For the acceleration, the calculation "without jumps" is made to add or subtract the value in the corresponding digit
Code: [Select]
`mov ebx, 4000h     mov ecx, 8000h     mov eax, 40000000h@@: cmp eax, X     je @f; premature exit from the cycle     sbb edx, edx; if CF = 1 then edx = 0FFFFFFFFh otherwise edx = 0     sub ecx, ebx     and edx, ebx; if CF = 1 then edx = ebx otherwise edx = 0     lea ecx, [ecx + edx * 2]; if CF = 1 then ecx = ecx + ebx otherwise ecx = ecx-ebx     shr ebx, 1; move to the next level     mov eax, ecx     mul eax; we get "eax squared"     test ebx, ebx     jnz @b@@: mov eax, ecx; eax: = sqrt (X)`It is not clear how, neither division, nor multiplication, but it works! And the algorithm was described already in 1945 in Von Neumann J. "First Draft of a Reaport on the EDVAC"
Code: [Select]
`    mov ebp,X    mov ecx,40000000h    xor eax,eax@@: mov ebx,eax    or ebx,ecx    shr eax,1    cmp ebp,ebx    sbb edx,edx    not edx    mov edi,edx    and edx,ebx    and edi,ecx    sub ebp,edx    or eax,edi    shr ecx,2    test ecx,ecx    jnz @b;eax=sqrt(X)`Or
Code: [Select]
`mov ebp, X         inc ebp     bsr ecx, ebp     and ecx, -2     mov ebx, 1     shl ebx, cl; to reduce the number of iterations         xor eax, eax@@: lea ecx, [eax + ebx]     shr eax, 1     cmp ecx, ebp     sbb edx, edx     mov edi, edx     and edx, ecx     sub ebp, edx     and edi, ebx     or eax, edi     shr ebx, 2     jnz @b`maybe these algorithms will be faster than fsqrt with 5703 cycles for 100
would be interesting to code it with help of SSE2 integer,think you could call different versions of same algo depending on size,,Think the speed if all variables are byte sized in a 128bit XMM regs
Quote from Flashdance
Nick  :  When you give up your dream, you die
*wears a flameproof asbestos suit*