News:

Masm32 SDK description, downloads and other helpful links
Message to All Guests

Main Menu

fsqrt vs rsqrtss

Started by jj2007, December 30, 2017, 09:25:57 PM

Previous topic - Next topic

jj2007

Category "almost unknown SIMD instructions" ;)
Intel(R) Core(TM) i5-2450M CPU @ 2.50GHz (SSE4)

5703    cycles for 100 * fsqrt
4380    cycles for 100 * rsqrtss

5684    cycles for 100 * fsqrt
4436    cycles for 100 * rsqrtss

5683    cycles for 100 * fsqrt
4435    cycles for 100 * rsqrtss

5646    cycles for 100 * fsqrt
4420    cycles for 100 * rsqrtss

5699    cycles for 100 * fsqrt
4396    cycles for 100 * rsqrtss

87      bytes for fsqrt
97      bytes for rsqrtss

9483.79199 = fsqrt
9483.57227 = rsqrtss


RSQRTSS--Scalar Single-Precision Floating-Point Square Root Reciprocal
Computes an approximate reciprocal of the square root of the low single-precision floating-point value in the source operand (second operand), stores the single-precision floating-point result in the destination operand. The maximum error for this approximation is (1.5 * 2-12). The source operand can be an XMM register or a 32-bit memory location. The destination operand is an XMM register.

Mikl__

#1
Ciao, jj2007!
There is soon a new year, so I'll a little flood
It would seem that it could be simpler than fsqrt, but we'll go the other way
1) The best-known integer algorithm for calculating the square root of a number is striking in its simplicity
unsigned sqrt_cpu_int(usigned X)
{ unsigned div = 1, result = 0;
         while (X > 0)
         { X -= div; div += 2; result += X < 0 ? 0 : 1; }
          return result;
}
The disadvantage of this algorithm is that the number of iterations will increase with increasing X
2) calculation of the square root of the Taylor series expansion.
Let X be any number; f(X) is a function depending on X; a is a known number close to X; f(a) is a known value of the function.
We expand f(X) in a Taylor series:
f(X)=f(a)+(X-a)f'(a)+((X-a)²∙f"(a))/2!+...+((X-a)n∙fn(a))/n!
Let X be the number from which to extract the root. Then f(X)=√X; a is a known number close to X; f(a)=√a is a known number close to √X, and f(X)=√a+(X-a)/(2√a)+...=(2a+X-a)/(2√a)=(a+X)/(2√a)
The value √X can be found by setting √a and then calculating f(X). f(X)² can be compared to the original number X. If the accuracy is insufficient, then the number a is replaced by f(X)², √a by f(X) and the calculation is repeated
3) the search for an integer square root by Newton's method begins with a certain value of g0, which is the initial estimate √X. Then a series of refinements of the value of the square root is performed by the formula gn+1 = (gn + X/gn)/2. To reduce the number of iterations, you can, in the first step, select the initial value for the variable g0{usigned x1;
int a, g0, g1;
if (x <= 1) return x;
a = 1;
x1 = x-1;
if (x1> 65535) {a = a + 8; x1 = x1 >> 16;}
if (x1> 255) {a = a + 4; x1 = x1 >> 8;}
if (x1> 15) {a = a + 2; x1 = x1 >> 4;}
if (x1> 3) {a = a + 1;}
g0 = 1 << a; // g0 = 2 //
g1 = (g0 + (x >> a)) >> 1;
while (g1 <g0) // we repeat, while the approximations are strictly reduced
{g0 = g1; g1 = (g0 + (x/g0)) >> 1}
}
4) Since the division is a fairly slow operation, you can abandon it. To calculate the square root, we use multiplication. A 32-bit X number will have a 16-bit square root. At each step, the 1 bit of the root value is refined. For the acceleration, the calculation "without jumps" is made to add or subtract the value in the corresponding digit mov ebx, 4000h
     mov ecx, 8000h
     mov eax, 40000000h
@@: cmp eax, X
     je @f; premature exit from the cycle
     sbb edx, edx; if CF = 1 then edx = 0FFFFFFFFh otherwise edx = 0
     sub ecx, ebx
     and edx, ebx; if CF = 1 then edx = ebx otherwise edx = 0
     lea ecx, [ecx + edx * 2]; if CF = 1 then ecx = ecx + ebx otherwise ecx = ecx-ebx
     shr ebx, 1; move to the next level
     mov eax, ecx
     mul eax; we get "eax squared"
     test ebx, ebx
     jnz @b
@@: mov eax, ecx; eax: = sqrt (X)
It is not clear how, neither division, nor multiplication, but it works! And the algorithm was described already in 1945 in Von Neumann J. "First Draft of a Reaport on the EDVAC"    mov ebp,X
    mov ecx,40000000h
    xor eax,eax
@@: mov ebx,eax
    or ebx,ecx
    shr eax,1
    cmp ebp,ebx
    sbb edx,edx
    not edx
    mov edi,edx
    and edx,ebx
    and edi,ecx
    sub ebp,edx
    or eax,edi
    shr ecx,2
    test ecx,ecx
    jnz @b
;eax=sqrt(X)
Ormov ebp, X
         inc ebp
     bsr ecx, ebp
     and ecx, -2
     mov ebx, 1
     shl ebx, cl; to reduce the number of iterations
         xor eax, eax
@@: lea ecx, [eax + ebx]
     shr eax, 1
     cmp ecx, ebp
     sbb edx, edx
     mov edi, edx
     and edx, ecx
     sub ebp, edx
     and edi, ebx
     or eax, edi
     shr ebx, 2
     jnz @b
maybe these algorithms will be faster than fsqrt with 5703 cycles for 100

Gunther

Hi Mikl__,

You have set out a very good overview of possible alternative algorithms.  :t Here you will find an interesting article on these questions. This is done in the high level language, but it should not be a problem to realize that in assembly language.

Gunther
You have to know the facts before you can distort them.

jj2007

If you come up with a convincing MASM algo, I can add it :bgrin:

(I doubt it will be any faster - if it's a good algo, the Intel guys realised it in micro-ops)

Mikl__

Hallo, Gunther!
Danke für den interessanten Artikel! Ich werde definitiv in Assembler-Sprache übersetzen und die empfangenen Programme auf Geschwindigkeit prüfen. Glückliches neues Jahr!

Gunther

Mikl__,

Желаю вам хорошего здоровья и счастья в следующем году.

Gunther
You have to know the facts before you can distort them.

aw27

JJ,
Why don't you mention SQRTSS or SQRTPS? Do you prefer reciprocals?  :shock:

jj2007

Quote from: aw27 on December 31, 2017, 03:06:00 AMWhy don't you mention SQRTSS or SQRTPS? Do you prefer reciprocals?  :shock:

Good idea:Intel(R) Core(TM) i5-2450M CPU @ 2.50GHz (SSE4)

5536    cycles for 100 * fsqrt
4300    cycles for 100 * rsqrtss
4638    cycles for 100 * sqrtss
5639    cycles for 100 * sqrtps

5532    cycles for 100 * fsqrt
4301    cycles for 100 * rsqrtss
4638    cycles for 100 * sqrtss
5641    cycles for 100 * sqrtps

5538    cycles for 100 * fsqrt
4301    cycles for 100 * rsqrtss
4640    cycles for 100 * sqrtss
5674    cycles for 100 * sqrtps

5555    cycles for 100 * fsqrt
4304    cycles for 100 * rsqrtss
4640    cycles for 100 * sqrtss
5709    cycles for 100 * sqrtps

5548    cycles for 100 * fsqrt
4305    cycles for 100 * rsqrtss
4641    cycles for 100 * sqrtss
5675    cycles for 100 * sqrtps

87      bytes for fsqrt
97      bytes for rsqrtss
89      bytes for sqrtss
87      bytes for sqrtps

9483.79199 = fsqrt
9483.57227 = rsqrtss
9483.79199 = sqrtss
1.16594337e+09 = sqrtps

The last result looks wrong, though ::)

Rand(0.0, 123456789.0, MyReal4)
sqrtps xmm0, oword ptr MyReal4
movaps oword ptr MyReal4, xmm0

Mikl__

Buon anno, jj!
Intel(R) Core(TM) i3-3210 CPU @ 3.20GHz (SSE4)

6647 cycles for 100 * fsqrt
5242 cycles for 100 * rsqrtss
5549 cycles for 100 * sqrtss
6665 cycles for 100 * sqrtps

6685 cycles for 100 * fsqrt
5245 cycles for 100 * rsqrtss
5544 cycles for 100 * sqrtss
6653 cycles for 100 * sqrtps

6657 cycles for 100 * fsqrt
5253 cycles for 100 * rsqrtss
5562 cycles for 100 * sqrtss
6653 cycles for 100 * sqrtps

6775 cycles for 100 * fsqrt
5310 cycles for 100 * rsqrtss
5543 cycles for 100 * sqrtss
6654 cycles for 100 * sqrtps

6655 cycles for 100 * fsqrt
5251 cycles for 100 * rsqrtss
5550 cycles for 100 * sqrtss
6647 cycles for 100 * sqrtps

87 bytes for fsqrt
97 bytes for rsqrtss
89 bytes for sqrtss
87 bytes for sqrtps

9483.79199 = fsqrt
9483.57227 = rsqrtss
9483.79199 = sqrtss
1.16594337e+09 = sqrtps

jj2007

Buon anno a te, Mikl :icon14:

LordAdef

Is "oword" a typo, a Johen's invention, or something I'm about to learn? :dazzled:

felipe

oword = 128 bits. Or 16 bytes. It's a data identifier like qword, dword, etc.

hutch--

MASM likes XMMWORD and YMMWORD. I think an earlier version of 32 bit MASM used OWORD.

LordAdef

thanks Felipe and Hutch, this one is new to me!

daydreamer

Quote from: Mikl__ on December 30, 2017, 10:47:53 PM
Ciao, jj2007!
There is soon a new year, so I'll a little flood
It would seem that it could be simpler than fsqrt, but we'll go the other way
1) The best-known integer algorithm for calculating the square root of a number is striking in its simplicity
unsigned sqrt_cpu_int(usigned X)
{ unsigned div = 1, result = 0;
         while (X > 0)
         { X -= div; div += 2; result += X < 0 ? 0 : 1; }
          return result;
}
The disadvantage of this algorithm is that the number of iterations will increase with increasing X
2) calculation of the square root of the Taylor series expansion.
Let X be any number; f(X) is a function depending on X; a is a known number close to X; f(a) is a known value of the function.
We expand f(X) in a Taylor series:
f(X)=f(a)+(X-a)f'(a)+((X-a)²∙f"(a))/2!+...+((X-a)n∙fn(a))/n!
Let X be the number from which to extract the root. Then f(X)=√X; a is a known number close to X; f(a)=√a is a known number close to √X, and f(X)=√a+(X-a)/(2√a)+...=(2a+X-a)/(2√a)=(a+X)/(2√a)
The value √X can be found by setting √a and then calculating f(X). f(X)² can be compared to the original number X. If the accuracy is insufficient, then the number a is replaced by f(X)², √a by f(X) and the calculation is repeated
3) the search for an integer square root by Newton's method begins with a certain value of g0, which is the initial estimate √X. Then a series of refinements of the value of the square root is performed by the formula gn+1 = (gn + X/gn)/2. To reduce the number of iterations, you can, in the first step, select the initial value for the variable g0{usigned x1;
int a, g0, g1;
if (x <= 1) return x;
a = 1;
x1 = x-1;
if (x1> 65535) {a = a + 8; x1 = x1 >> 16;}
if (x1> 255) {a = a + 4; x1 = x1 >> 8;}
if (x1> 15) {a = a + 2; x1 = x1 >> 4;}
if (x1> 3) {a = a + 1;}
g0 = 1 << a; // g0 = 2 //
g1 = (g0 + (x >> a)) >> 1;
while (g1 <g0) // we repeat, while the approximations are strictly reduced
{g0 = g1; g1 = (g0 + (x/g0)) >> 1}
}
4) Since the division is a fairly slow operation, you can abandon it. To calculate the square root, we use multiplication. A 32-bit X number will have a 16-bit square root. At each step, the 1 bit of the root value is refined. For the acceleration, the calculation "without jumps" is made to add or subtract the value in the corresponding digit mov ebx, 4000h
     mov ecx, 8000h
     mov eax, 40000000h
@@: cmp eax, X
     je @f; premature exit from the cycle
     sbb edx, edx; if CF = 1 then edx = 0FFFFFFFFh otherwise edx = 0
     sub ecx, ebx
     and edx, ebx; if CF = 1 then edx = ebx otherwise edx = 0
     lea ecx, [ecx + edx * 2]; if CF = 1 then ecx = ecx + ebx otherwise ecx = ecx-ebx
     shr ebx, 1; move to the next level
     mov eax, ecx
     mul eax; we get "eax squared"
     test ebx, ebx
     jnz @b
@@: mov eax, ecx; eax: = sqrt (X)
It is not clear how, neither division, nor multiplication, but it works! And the algorithm was described already in 1945 in Von Neumann J. "First Draft of a Reaport on the EDVAC"    mov ebp,X
    mov ecx,40000000h
    xor eax,eax
@@: mov ebx,eax
    or ebx,ecx
    shr eax,1
    cmp ebp,ebx
    sbb edx,edx
    not edx
    mov edi,edx
    and edx,ebx
    and edi,ecx
    sub ebp,edx
    or eax,edi
    shr ecx,2
    test ecx,ecx
    jnz @b
;eax=sqrt(X)
Ormov ebp, X
         inc ebp
     bsr ecx, ebp
     and ecx, -2
     mov ebx, 1
     shl ebx, cl; to reduce the number of iterations
         xor eax, eax
@@: lea ecx, [eax + ebx]
     shr eax, 1
     cmp ecx, ebp
     sbb edx, edx
     mov edi, edx
     and edx, ecx
     sub ebp, edx
     and edi, ebx
     or eax, edi
     shr ebx, 2
     jnz @b
maybe these algorithms will be faster than fsqrt with 5703 cycles for 100
would be interesting to code it with help of SSE2 integer,think you could call different versions of same algo depending on size,,Think the speed if all variables are byte sized in a 128bit XMM regs
my none asm creations
https://masm32.com/board/index.php?topic=6937.msg74303#msg74303
I am an Invoker
"An Invoker is a mage who specializes in the manipulation of raw and elemental energies."
Like SIMD coding