Square root

Farabi · January 03, 2013, 03:42:37 PM

Ah, the reciprocal square root can make things faster for the direction vector.

If I had a floating point A, and I want to copy that A value to all 32-bit slots on SSE register, what instruction shoudl I use?

Code Select



movd xmm0,eax
shuffd xmm0,xmm0,0

??

FORTRANS · January 04, 2013, 12:50:10 AM

Quote from: dedndave on January 03, 2013, 10:26:12 AM
it isn't hard
you deal with 2 input digits and one root digit at a time
then, basically a trial and error thing :P
if the 2 digits are 20, you know the root digit is between 4 and 5 - use 4
then do the next 2 input digits

Hi,

Well someting like that, but had a set of rules for the
fracional part. You could get four to six digits fairly easily

Regards,

Steve N.

FORTRANS · January 04, 2013, 01:09:30 AM

Quote from: KeepingRealBusy on January 03, 2013, 02:13:49 PM
I have the solution for you if you want it, along with several other methods I dreamed up and timed. I was doing the square root of BIG NUMS (up to 64 KB).

OBTW, the paper and pencil method came from a 1930's electronics text book my Dad had.

Dave.

Hi,

Sure, I would be interested in the method. Though my large
math code is fixed-point. And smaller (so far) than 64 KB.
A couple of hundred digits before losing interest and going
back to the FPU. Might be nice to have a square root.

The manual paper and pencil method I found was at a "Mr.
Math" site. It pairs digits and performs a long division type
of algorithm. And a bit involved to do in my head. Though
either similar or the same as what I was doing, it 'seems'
more complicated than what I vaugely remember.

Cheers,

Steve N.

dedndave · January 04, 2013, 01:50:41 AM

it's the same thing in binary, except you work on 2 input binary digits and 1 root binary digit at a time

here is an old 16-bit squarerooter of mine from the 80's

it was written for the 8088 instruction set, so it could be amped up a bit

Code Select

;-----------------------------------------------
;
SQINT   PROC    NEAR
;
;  This is a 24-bit integer squareroot routine.
; It calculates the squareoroot with 24-bit
; accuracy, regardless of the size of the input
; value, then rounds it to the nearest integer.
;
; Call With: DL:AX = 24 bit integer (0-FF:FFFFh)
;
;   Returns: AX = rounded squareroot (0-1000h)
;            BX = DX = CH = 0
;            CF = clear
;            CL, BP, SI, DI invalid
;
        MOV     DH,DL
        MOV     DL,AH
        MOV     BH,AL
        MOV     AX,0C000h
        MOV     BL,AL
        MOV     CX,8
;
SQINT0: TEST    AX,DX
        JNZ     SQINT3
;
        ROR     AX,1
        ROR     AX,1
        LOOP    SQINT0
;
        MOV     CL,4
;
SQINT1: TEST    AH,BH
        JNZ     SQINT2
;
        SHR     AH,1
        SHR     AH,1
        LOOP    SQINT1
;
        RET
;
SQINT2: SUB     CL,4
;
SQINT3: SHL     CL,1
        ADD     CL,9
        XOR     AX,AX
        MOV     SI,AX
        MOV     DI,AX
        MOV     BP,AX
;
SQINT4: SHL     BX,1
        RCL     DX,1
        RCL     DI,1
        RCL     BP,1
        SHL     BX,1
        RCL     DX,1
        RCL     DI,1
        RCL     BP,1
        STC
        RCL     SI,1
        RCL     AX,1
        SUB     DI,SI
        SBB     BP,AX
        JNC     SQINT6
;
        ADD     DI,SI
        ADC     BP,AX
        DEC     SI
        LOOP    SQINT4
;
SQINT5: SHL     SI,1
        RCL     AX,1
        SHL     SI,1
        RCL     AX,1
        SHL     SI,1
        ADC     AX,CX
        RET
;
SQINT6: INC     SI
        LOOP    SQINT4
;
        JMP     SQINT5
;
SQINT   ENDP
;
;-----------------------------------------------

FORTRANS · January 04, 2013, 04:24:10 AM

Hi Dave,

Yes, thanks. You posted it in the old forum. Without
comments, it's a bit hard to see what is going on. Though,
IIRC, I tested it back when you posted it. And, I tend to
wonder what an interger square root is used for. You (I)
see a lot of code to do them, so it should be useful in low
level graphics or ray tracing. What did you use it for?

At one time I started a square root for my fixed point code.
I think I got as far as a blank stub with an algorithm in the
comments. Oh well.

Cheers,

Steve

dedndave · January 04, 2013, 04:39:01 AM

it would be much easier to understand if i had used the pentium instruction set :P
that's because, on the 8088, we didn't have SHRD, SHLD
if you wanted to ripple a bit through 4 word registers...

Code Select

        SHL     BX,1
        RCL     DX,1
        RCL     DI,1
        RCL     BP,1

you can see how much easier that would be to read using the newer instructions

the routine does very much what you did in high-school :P
except, it's binary, of course
it tests 2 input bits, sets an accumulator bit, tests it (and adjusts as reqd), then shifts left and does the next
i was limited to 24 bits because i ran out of registers - lol

the main use for it is graphics, as you suggested
i think i wrote it for a bresenham-like circle routine
but, it is also useful for estimating larger square roots
well - it was back then

nowdays, it's only real use is to demonstrate the algorithm
i use it as a test to see if i can assemble 16-bit code :lol:
i have my machine set up so i can use masm 6.15 or masm 5.10 to build 16-bit programs

KeepingRealBusy · January 04, 2013, 05:00:39 AM

To all,

Here are the comments for the methods, the code gets involved with my exact implementation and you would need the whole library:

Code Select


;-------------------------------------------------------------------------------
;
;   Extract the square root from a LONG_NUMB value.
;
;   Supply the LONG_NUMB OFFSET for the source number, the LONG_NUMB OFFSET for
;   the SquareRoot, and the LONG_NUMB OFFSET for the remainder.
;
;   Returns status flags.
;
;-------------------------------------------------------------------------------
Comment~

one way to do it:

square = (a+b)*(a+b) = (a**2 + 2ab + b**2)
a = sqrt(max power in square) = 2**((max bits in square - 1) / 2)
rem = (square - a**2)
b = sqrt(max power in rem) = 2**((max bits in rem - 1) / 2) but b must be < a, if not, then decrement b
iterate
rem = (rem - 2ab - b**2) but rem must be >= 0, if not, then decrement b and iterate
set a = (a + b)
if (rem > 0) and (b > 0) then iterate
set root = a, leave rem

example calculation:

sqr = (2**19 + 2**18 + 2**17)
a = 2**(19/2) = 2**9
a**2 = 2**18
rem = (2**19 + 2**18 + 2**17 - 2**18) = (2**19 + 2**17)
a = 2**9
b = 2**(19/2)
b = 2**9 but b must be < a so b = 2**8
2ab = 2**10*2**8 = 2**18
b**2 = 2**16
rem = (rem - 2ab - b**2) = (2**19 + 2**17 - 2**18 - 2**16) = (2**18 + 2**18 + 2**16 + 2**16 - 2**18 - 2**16)
rem = (2**18 + 2**16)
a = (a + b) = (2**9 + 2**8)
b = 2**(18/2) = 2**9 but b must be < last so b = 2**7
2ab = ((2**10 + 2**9) * 2**7) = (2**17 + 2**16)
b**2 = 2**14
rem = (rem - 2ab - b**2) = (2**18 + 2**16 - 2**17 - 2**16 - 2**14) = (2**17 + 2**17 + 2**16 - 2**17 - 2**16 - 2**14)
rem = (2**17 - 2**14) = (2**16 + 2**15 + 2**14 + 2**14 - 2**14)
rem = (2**16 + 2**15 + 2**14 )
a = (a + b) = (2**9 + 2**8 + 2**7)
b = 2**(16/2) = 2**8 but b must be < last so b = 2**6
2ab = ((2**10 + 2**9 + 2**8) * 2**6) = (2**16 + 2**15 + 2**14)
b**2 = 2**12
rem = (rem - 2ab - b**2) = (2**16 + 2**15 + 2**14 - 2**16 - 2**15 - 2**14 - 2**12)
rem = (- 2**12) but rem must be > 0 so b = 2**5
rem = (2**16 + 2**15 + 2**14 )
a = (a + b) = (2**9 + 2**8 + 2**7)
b = 2**5
2ab = ((2**10 + 2**9 + 2**8) * 2**5) = (2**15 + 2**14 + 2**13)
b**2 = 2**10
rem = (rem - 2ab - b**2) = (2**16 + 2**15 + 2**14 - 2**15 - 2**14 - 2**13 - 2**10)
rem = (2**16 - 2**13 - 2**10) = (2**15 + 2**14 + 2**13 + 2**13 - 2**13 - 2**10) = (2**15 + 2**14 + 2**13 - 2**10)
rem = (2**15 + 2**14 + 2**12 + 2**11 + 2**10 + 2**10 - 2**10)
rem = (2**15 + 2**14 + 2**12 + 2**11 + 2**10)
a = (a + b) = (2**9 + 2**8 + 2**7 + 2**5)
b = 2**(15/2) = 2**7 but b must be < last so b = 2**4
2ab = ((2**10 + 2**9 + 2**8 + 2**6) * 2**4) = (2**14 + 2**13 + 2**12 + 2**10)
b**2 = 2**8
rem = (rem - 2ab - b**2) = (2**15 + 2**14 + 2**12 + 2**11 + 2**10 - 2**14 - 2**13 - 2**12 - 2**10 - 2**8)
rem = (2**15 + 2**11 - 2**13 - 2**8) = (2**14 + 2**13 + 2**13 + 2**11 - 2**13 - 2**8)
rem = (2**14 + 2**13 + 2**11 - 2**8) = (2**14 + 2**13 + 2**10 + 2**9 + 2**8 + 2**8 - 2**8)
rem = = (2**14 + 2**13 + 2**10 + 2**9 + 2**8)
a = (a + b) = (2**9 + 2**8 + 2**7 + 2**5 + 2**4)
b = 2**(14/2) = 2**7 but b must be < last so b = 2**3
2ab = ((2**10 + 2**9 + 2**8 + 2**6 + 2**5) * 2**3) = (2**13 + 2**12 + 2**11 + 2**9 + 2**8)
b**2 = 2**6
rem = (rem - 2ab - b**2)
rem = (2**14 + 2**13 + 2**10 + 2**9 + 2**8 - 2**13 - 2**12 - 2**11 - 2**9 - 2**8 - 2**6)
rem = (2**14  + 2**10 - 2**12 - 2**11 - 2**6) = (2**13 + 2**12 + 2**12 + 2**10 - 2**12 - 2**11 - 2**6)
rem = (2**13 + 2**12 + 2**10 - 2**11 - 2**6) = (2**13 + 2**11 + 2**11 + 2**10 - 2**11 - 2**6)
rem = (2**13 + 2**11 + 2**10 - 2**6) = (2**13 + 2**11 + 2**9 + 2**8 + 2**7 + 2**6 + 2**6 - 2**6)
rem = (2**13 + 2**11 + 2**9 + 2**8 + 2**7 + 2**6)
a = (a + b) = (2**9 + 2**8 + 2**7 + 2**5 + 2**4 + 2**3)
b = 2**(13/2) = 2**6 but b must be < last so b = 2**2
2ab = ((2**10 + 2**9 + 2**8 + 2**6 + 2**5 + 2**4) * 2**2) = (2**12 + 2**11 + 2**10 + 2**8 + 2**7 + 2**6)
b**2 = 2**4

rem = (rem - 2ab - b**2) = ((2**13 + 2**11 + 2**9 + 2**8 + 2**7 + 2**6) -
                            (2**12 + 2**11 + 2**10 + 2**8 + 2**7 + 2**6) - 2**4)
rem = (2**13 + 2**9 - 2**12 - 2**10 - 2**4) = (2**12 + 2**11 + 2**10 + 2**10 + 2**9 - 2**12 - 2**10 - 2**4)
rem = (2**11 + 2**10 + 2**9 - 2**4) = (2**11 + 2**10 + 2**8 + 2**7 + 2**6 + 2**5 + 2**4 + 2**4 - 2**4)
rem = (2**11 + 2**10 + 2**8 + 2**7 + 2**6 + 2**5 + 2**4)
a = (a + b) = (2**9 + 2**8 + 2**7 + 2**5 + 2**4 + 2**3 + 2**2)
b = 2**(11/2) = 2**5 but b must be < last so b = 2**1
2ab = ((2**10 + 2**9 + 2**8 + 2**6 + 2**5 + 2**4 + 2**3) * 2**1)
2ab = (2**11 + 2**10 + 2**9 + 2**7 + 2**6 + 2**5 + 2**4)
b**2 = 2**2
rem = (rem - 2ab - b**2) = ((2**11 + 2**10 + 2**8 + 2**7 + 2**6 + 2**5 + 2**4) -
                                                        (2**11 + 2**10 + 2**9 + 2**7 + 2**6 + 2**5 + 2**4) - 2**2)
rem = (2**11 + 2**10 + 2**8 + 2**7 + 2**6 + 2**5 + 2**4 -
                                                        2**11 - 2**10 - 2**9 - 2**7 - 2**6 - 2**5 - 2**4 - 2**2)
rem = (2**8 - 2**9 - 2**2) = (2**8 - 2**8 - 2**8 - 2**2) = (- 2**8 - 2**2)
but rem Must be > 0 so b = 2**0
rem = (2**11 + 2**10 + 2**8 + 2**7 + 2**6 + 2**5 + 2**4)
a = (a + b) = (2**9 + 2**8 + 2**7 + 2**5 + 2**4 + 2**3 + 2**2)
b = 2**0
2ab = ((2**10 + 2**9 + 2**8 + 2**6 + 2**5 + 2**4 + 2**3) * 2**0)
2ab = (2**10 + 2**9 + 2**8 + 2**6 + 2**5 + 2**4 + 2**3)
b**2 = 2**0
rem = (rem - 2ab - b**2) = ((2**11 + 2**10 + 2**8 + 2**7 + 2**6 + 2**5 + 2**4) -
                            (2**10 + 2**9 + 2**8 + 2**6 + 2**5 + 2**4 + 2**3) - 2**0)
rem = (2**11 + 2**10 + 2**8 + 2**7 + 2**6 + 2**5 + 2**4 -
                            2**10 - 2**9 - 2**8 - 2**6 - 2**5 - 2**4 - 2**3 - 2**0)
rem = (2**11 + 2**7 - 2**9 - 2**3 - 2**0) = (2**10 + 2**9 + 2**9 + 2**7 - 2**9 - 2**3 - 2**0)
rem = (2**10 + 2**9 + 2**7 - 2**3 - 2**0)
rem = (2**10 + 2**9 + 2**6 + 2**5 + 2**4 + 2**3 + 2**2 + 2**1 + 2**0 + 2**0 - 2**3 - 2**0)
rem = (2**10 + 2**9 + 2**6 + 2**5 + 2**4 + 2**2 + 2**1 + 2**0)
a = (a + b) = (2**9 + 2**8 + 2**7 + 2**5 + 2**4 + 2**3 + 2**2 + 2**0)
b = 0, no more b changes


(2**19 + 2**18 + 2**17) = 917504
(2**10 + 2**9 + 2**6 + 2**5 + 2**4 + 2**2 + 2**1 + 2**0) = 1655
(2**9 + 2**8 + 2**7 + 2**5 + 2**4 + 2**3 + 2**2 + 2**0) = 957
sqrt(917504) = 957 + 1655
957**2 = 915849
958**2 = 917764 (too big)

Start with rem = square and a = 0 and (last b) = (bits in rem)
iterate:
double a
b = (greatest power of 2 in rem) = (bits in rem - 1)
try:
if b >= (last b) then set b = ((last b) - 1)
set (last b) = b
b**2 = ((bits in rem - 1) * 2)
trial = double a + b**2
if trial > rem decrement b and try
rem = (rem - trial) iterate until b = 0

All calculations are either a double of a or adding a power of 2 to a or subtracting the result from rem.

Another way (the old mathematics way with a computer age twist - the digits have 4 gigabytes of values - 32 bits,
not 10 values, 4 bits). This is loosely based on the binary expansion of (a + b + c)**2 where a, b, and c are
digits in a composite number such that a = a*base**2, b = b*base**1, and c = c*base**0:

With any multiplication, the most than can result from two digits is a double digit (sum of the bits)

(a + b + c) * (a + b + c) =
a**2 + 2a*b + b**2 + 2b*c + c**2 where each successive column is scaled by one more power of the base than the following
            + 2a*c

i.e. a**2*base**4 + 2a*b*base**3 + b**2*base**2 + 2b*c*base**1 + c**2*base**0
                                 + 2a*c*base**2

these values are each actually two digits and the digits overlap,

i.e. de*base**4 + fg*base**3 + hi*base**2 + jk*base**1 + lm*base**0
                             + no*base**2

i.e. d*base**5 + e*base**4
               + f*base**4 + g*base**3
                           + h*base**3 + i*base**2
                           + n*base**3 + o*base**2
                                       + j*base**2 + k*base**1
                                                   + l*base**1 + m*base**0

or splitting the digits and simply putting them in columns:

 d e
   f g
     h i
     n o
       j k
         l m

consider a long integer ab where a and b are 32 bit unsigned integers, The
square of the  long integer would be calculated as:

      a b                              7 9          9 9       F F
    x a b                              7 9          9 9       F F
      ---                              ---          ---       ---
      c d b**2                         8 1          8 1       E 1
    e f   bxa                        6 3          8 1       E 1
    e f   axb                        6 3          8 1       E 1
  g h     a**2                     4 9          8 1       E 1
  -------                          -------      -------   -------
  i j k d (a b)**2                 6 2 4 1      9 8 0 1   F E 0 1
 +    l m remainder < (ab)                              +     F E
---------                                                 -------
  n o p q value                                           F E F F
- g h     (sqrt(n o))**2 (a)                            - E 1     (SQRT(F E))**2
  -------                                                 -------
  r s p q                                                 1 D F F

find sqrt(n o) and use as first digit of root
square the first digit and subtract from (n o) giving (r s)
DOALL:
bring down next digit (p) giving (r s p)
double the root and divide (r s p) by the double giving next trial unit digit t
bring down next digit (q) giving (r s p q)
bring down a 0 for the root giving a0
TRYIT:
form (2a0*t+t**2) = ((2a0+t)*t)
if ((2a0+t)*t) > (r s p q) decrement (t) and goto TRYIT

Now, t must be a single digit because if it was a composite of two digits, then it would cause an
incrementing of the next larger most significant digit, but this has already been determined to
be the largest possible digit in that position, so if this happens, then the trial digit should
be set to a single digit with a maximum value and decrement from there until valid.

Actually, the decrementing will be done by binary division from the first trial
and a zero value until the largest value below the limit is found otherwise you
could be in for 400,000,000 trials (4 gigabytes / 10).

subtract ((2a0+t)*t) from (r s p q)
form (root=root*base+t)
continue at DOALL for all digit pairs

Try with 567 squared + 11 remainder

   567
  x567
------
321489
   +11
------
321500


32**(1/2)=5
(32-(5**2))=(32-25)=7
(71/(2*5))=(71/10)=7
(715-(2*50*7+7**2))=(715-(700+49))=(715-749)=-34
use 6 instead
(715-(2*50*6+6**2))=(715-(600+36))=(715-636)=79
(790/(2*56))=(790/112)=7
(7900-(2*560*7+7**2))=(7900-(1120*7+49))=(7900-(7840+49))=(7900-7889)=11
square root = 567
remainder = 11

~

I have somehow lost my Dad's Audel's math for engineers book (1932 as I remember) so I can't exactly quote the method given there, but this method (the math method not the binary method) was extracted from that book in 2003/2004.

Dave

raymond · January 04, 2013, 06:52:26 AM

Dave,
You may want to compare your implementation for binary square roots with the sqrt64 (or sqrt32) functions provided in the MixLib library which you can get from
http://www.ray.masmcode.com/fixmath.html
The download contains all the source code for the functions.

If anyone is interested, those functions could always be extended to whatever input size is required.

dedndave · January 04, 2013, 06:58:39 AM

hiya Ray
it looks like a similar algorithm - pentium instruction set, of course
and - you unrolled it for speed :P

Gunther · January 04, 2013, 07:15:16 AM

Hi Raymond,

Quote from: raymond on January 04, 2013, 06:52:26 AM
Dave,
You may want to compare your implementation for binary square roots with the sqrt64 (or sqrt32) functions provided in the MixLib library which you can get from
http://www.ray.masmcode.com/fixmath.html
The download contains all the source code for the functions.

If anyone is interested, those functions could always be extended to whatever input size is required.

rock solid work. :t Thank you for the link.

Gunther

raymond · January 04, 2013, 07:21:08 AM

Quote from: dedndave on January 04, 2013, 06:58:39 AM
hiya Ray
it looks like a similar algorithm - pentium instruction set, of course
and - you unrolled it for speed :P

I didn't go to the trouble of analyzing his procedure in depth but I wouldn't be surprised if it would be similar.
I developed mine many years ago somewhat by accident when I tried to adapt my "abacus" procedure for extracting square roots of decimal numbers to binary numbers; and it worked!!!

dedndave · January 04, 2013, 07:24:14 AM

was that on your commodore 64 ?

raymond · January 04, 2013, 08:47:44 AM

The first "abacus" algo for decimals was on my TRS-80 with 64k total memory, computing square roots with up to 9999 decimal places!

It's only later on with the 8088 that I started wondering if a similar approach could also be applied to binary numbers. I was elated when I found that it did work, without ever really understanding why it did. :redface:

The MASM Forum

News:

Square root

Farabi

FORTRANS

FORTRANS

dedndave

FORTRANS

dedndave

KeepingRealBusy

raymond

dedndave

Gunther

raymond

dedndave

raymond