Some testing has been performed lately to compare the speed of different procedures to extract square roots. Since one of the purposes was to search for prime numbers, and prime numbers are integers, this prompted me to revive an old procedure to extract the 'integer' square root of 'integer' numbers with 'integer' code instructions, and compare its speed to the general use of the FPU.
Although it may be three times slower than the FPU for the largest 31-bit integers, some may be interested in seeing this simple algo. It uses only the EAX, ECX and EDX registers (and it could be easily adapted to the 64-bit system by using the three equivalent 64-bit registers, as long as the bsr and bts instructions are still in use).
This is an adaptation of a procedure which was described in an IBM user's manual for their mechanical calculator marketed in the mid-50s. Not being a mathematician, I don't know where they may have obtained such a procedure nor can I explain why it works. I would assume that it may have been developed by the ancients to use with an "abacus".
;this "getsqr" procedure would be called with the target integer already in the EAX register
;the programmer would be responsible to check it's validity
;'negative' integers would simply be handled as positive 32-bit integers
getsqr:
bsr edx,eax ;get position number of most significant digit in EAX
xor ecx,ecx
bts ecx,edx ;ECX to be used for trailing bit position
shr edx,1 ;check if position was in even or odd number bit
.if CARRY? ;if EDX was odd, the target has even number of significant bits
shr ecx,1 ;shift ECX, it must start at the next even position
.endif
xor edx,edx ;EDX to be used for sqr
sq1:
add edx,ecx
.if edx > eax
sub edx,ecx
jmp sq2
.endif
sub eax,edx
add edx,ecx
sq2:
shr edx,1
shr ecx,1
jc sq3 ;last integer bit of root has been set
shr ecx,1 ;shift under next even position
jmp sq1
sq3:
;at this point, the EDX register would contain the truncated square root result.
;if a rounded result would be preferred, the following code must then be used
; .if eax > edx
; inc edx
; .endif
ret ;sqr answer is returned in the EDX register
Hi raymond
Very interesting :thumbsup:
Thank you.
Regards, Biterider
Hi Raymond!
Quote from: raymond on January 31, 2025, 02:07:46 PMand it could be easily adapted to the 64-bit system by using the three equivalent 64-bit registers, as long as the bsr and bts instructions are still in use
That is correct :thumbsup:
Regards, HSE
Quote from: raymond on January 31, 2025, 02:07:46 PMNot being a mathematician
I always thought you were the
only mathematician in this forum, Raymond :cool:
Thanks, great raymond :thumbsup:
It Shows similarity to a division proc based on how human calculate many digits with pen & paper in school
Héctor 64 bit conversion :thumbsup:
Just thought of a detail which I had not mentioned in my original post. Using the given code, regardless if you modify it or not to get a "rounded" square root, the EAX register would not be altered. Then, upon return, if EAX=0, the result in the EDX register would be the exact square root.
Without the rounding code, the resulting content of the EAX register would be equivalent to the remainder of a division of the target number by the truncated square root.
Hi,
To compute the square root of an integer using SSE instructions in MASM64 assembly, follow these steps:
Load the integer into a general-purpose register (e.g., RAX).
Convert the integer to a double-precision float using CVTSI2SD.
Compute the square root using the SQRTSD instruction.
Result is stored in an XMM register as a double-precision float.
Here's the code example:
.data
num QWORD 25 ; Example integer (64-bit)
.code
main proc
; Load the integer into RAX
mov rax, num
; Convert RAX to double-precision float in XMM0
cvtsi2sd xmm0, rax
; Compute square root of XMM0
sqrtsd xmm0, xmm0
; The result is now in XMM0 as a double-precision float
; Optionally convert back to integer (truncates decimal part)
; cvttsd2si rax, xmm0
ret
main endp
Explanation:
cvtsi2sd xmm0, rax : Converts the 64-bit integer in RAX to a double-precision float in XMM0.
sqrtsd xmm0, xmm0 : Computes the square root of the value in XMM0, storing the result back in XMM0.
Optional Conversion : Use cvttsd2si to convert the result back to an integer if needed, but note this truncates the decimal part.
This approach handles 64-bit integers accurately using double-precision floating-point operations.
For 32-bit integers, the code remains valid as the upper bits are zero-extended. :biggrin:
Hey, since we're now posting totally trivial code examples (yours boils down to really just one instruction: sqrtsd xmm0, xmm0), I'll offer my contribution:
fsqrt
fistp integerSqrt
See how simple that is?
Quote from: ognil on March 06, 2025, 12:13:49 PMCompute the square root using the SQRTSD instruction.
Eleven years ago, in the Lab (https://masm32.com/board/index.php?topic=3125.msg32376#msg32376)
Quote from: NoCforMe on March 06, 2025, 05:16:55 PMyours boils down to really just one instruction: sqrtsd xmm0, xmm0
Right.
QuoteEleven years ago, in the Lab
Тhanks jj, :thumbsup:
I saw MichaelW's post and was immediately disappointed and lost interest in it.
For me, posting code without a detailed explanation of each line and instruction has zero value due to the lack of additional time and desire to study it. The same goes for the macros that are used.
This is another example of bad programming practice that is widely used in this forum. :sad:
Quote from: ognil on March 07, 2025, 02:07:44 AMQuoteEleven years ago, in the Lab
Тhanks jj, :thumbsup:
You completely missed the point here.
Your example was completely trivial. Therefore not helpful.
The discussion was about finding integer square roots
without using the FPU (and therefore without using xmm instructions either).
Seems to have flown right over your head ...
Quote from: NoCforMe on March 07, 2025, 06:02:54 AMYou completely missed the point here.
...
Seems to have flown right over your head ...
It was just another opportunity for him to take shots at the members here:
Quote from: ognil on March 07, 2025, 02:07:44 AMThis is another example of bad programming practice that is widely used in this forum. :sad:
:rolleyes: :rolleyes:
Quote from: zedd151 on March 07, 2025, 06:41:40 AMQuote from: NoCforMe on March 07, 2025, 06:02:54 AMYou completely missed the point here.
...
Seems to have flown right over your head ...
It was just another opportunity for him to take shots at the members here:
Quote from: ognil on March 07, 2025, 02:07:44 AMThis is another example of bad programming practice that is widely used in this forum. :sad:
:rolleyes: :rolleyes:
Plus I'm sure it came from some "AI" source ...
QuoteYou completely missed the point here.
Your example was completely trivial. Therefore not helpful.
The discussion was about finding integer square roots without using the FPU (and therefore without using xmm instructions either).
Seems to have flown right over your head ...
Thank you NoCforMe for controlling your emotions, :badgrin:
To completely calm down, this code is specially created for you. :biggrin:
To compute the integer square root of a 64-bit integer without using SSE or FPU instructions in 64 assembly, we can use a binary search approach.
The algorithm repeatedly narrows down the range of possible square roots by comparing the square of the midpoint with the target number. Here's the step-by-step explanation and the corresponding assembly code:
Approach
Binary Search : The algorithm maintains a range [low, high] and iteratively computes the midpoint mid of this range.
Square Check : For each midpoint mid, compute mid^2 and compare it with the target number n.
Adjust Range : If mid^2 is less than or equal to n, update the lower bound to mid + 1 and keep track of the largest valid mid. If mid^2 exceeds n, adjust the upper bound to mid - 1.
Termination : The loop continues until the range is exhausted, at which point the largest valid mid is the integer square root.
Solution
isqrt:
push rbx ; Save callee-saved register
mov rcx, rdi ; rcx = n (input)
xor rbx, rbx ; rbx = result (initially 0)
mov rdi, 0 ; rdi = low (initially 0)
mov rsi, rcx ; rsi = high (initially n)
loop_start:
; Calculate mid = low + (high - low) / 2
mov rax, rsi ; rax = high
sub rax, rdi ; rax = high - low
shr rax, 1 ; rax = (high - low) / 2
add rax, rdi ; rax = mid
; Compute mid^2 and check against n
mov r8, rax ; r8 = mid
mul rax ; rdx:rax = mid^2
cmp rdx, 0 ; Check if high 64 bits are zero
jne mid_too_big ; If not, mid^2 > n
cmp rax, rcx ; Compare low 64 bits with n
jg mid_too_big ; If mid^2 > n
; Update result and adjust low
mov rbx, r8 ; Update result to mid
add r8, 1 ; mid + 1
mov rdi, r8 ; low = mid + 1
jmp next_iter
mid_too_big:
; Adjust high
sub r8, 1 ; mid - 1
mov rsi, r8 ; high = mid - 1
next_iter:
; Check if low <= high
cmp rdi, rsi
jle loop_start
; Return result
mov rax, rbx ; Move result to rax
pop rbx ; Restore callee-saved register
ret
Explanation
Initialization : The input number n is stored in rcx, and the initial range [low, high] is set to [0, n].
Midpoint Calculation : The midpoint mid is computed as low + (high - low) / 2 to avoid overflow.
Square Check : The square of mid is calculated using the mul instruction, which produces a 128-bit result in rdx:rax. If the high 64 bits (rdx) are non-zero, mid^2 exceeds n. Otherwise, compare the low 64 bits (rax) with n.
Adjust Range : Depending on whether mid^2 is within bounds, adjust low or high and update the result if a valid midpoint is found.
Termination : The loop exits when low exceeds high, and the result is returned in rax.
This approach efficiently narrows down the possible values using binary search, ensuring logarithmic time complexity relative to the input size. :smiley:
Now I am bit insecure whether Ognil is really Lingo, because that code works indeed :cool:
Quote from: jj2007 on March 07, 2025, 08:19:22 AMNow I am bit insecure whether Ognil is really Lingo, because that code works indeed :cool:
include \masm64\include64\masm64rt.inc
.data
.code
start proc
mov rdi, 1000000
call isqrt
invoke ExitProcess, 0
start endp
align 16
isqrt:
push rbx ; Save callee-saved register
mov rcx, rdi ; rcx = n (input) ; input number passed in rdi
xor rbx, rbx ; rbx = result (initially 0)
mov rdi, 0 ; rdi = low (initially 0)
mov rsi, rcx ; rsi = high (initially n)
loop_start:
; Calculate mid = low + (high - low) / 2
mov rax, rsi ; rax = high
sub rax, rdi ; rax = high - low
shr rax, 1 ; rax = (high - low) / 2
add rax, rdi ; rax = mid
; Compute mid^2 and check against n
mov r8, rax ; r8 = mid
mul rax ; rdx:rax = mid^2
cmp rdx, 0 ; Check if high 64 bits are zero
jne mid_too_big ; If not, mid^2 > n
cmp rax, rcx ; Compare low 64 bits with n
jg mid_too_big ; If mid^2 > n
; Update result and adjust low
mov rbx, r8 ; Update result to mid
add r8, 1 ; mid + 1
mov rdi, r8 ; low = mid + 1
jmp next_iter
mid_too_big:
; Adjust high
sub r8, 1 ; mid - 1
mov rsi, r8 ; high = mid - 1
next_iter:
; Check if low <= high
cmp rdi, rsi
jle loop_start
; Return result
mov rax, rbx ; Move result to rax
pop rbx ; Restore callee-saved register
ret
end
:biggrin:
I tested it too.
He forgot to mention that the input number is passed in rdi. I input
1000000, results of course
1000I only slapped this together, ran and and inspected the results in rax in the debugger x64dbg.
If you want an interface or some kind of display, maybe he can provide it. :tongue:
edit: I had the isqrt function within start procedure, I fixed that little issue. :rolleyes:
I was wondering why there was a LEAVE before the return there. :biggrin:
Quote from: jj2007 on March 07, 2025, 08:19:22 AMNow I am bit insecure whether Ognil is really Lingo, because that code works indeed :cool:
For that you can thank "AI", not Lingo/Ognil.
Thanks guys for the compliments, :badgrin:
but I don't think Raymond, who started this post with interesting code (unlike you), will like them very much... :sad:
You know, you really are a total ass, whoever you are.
I'm pretty sure I speak for the congregation on this.
Quote from: ognil on March 07, 2025, 10:51:25 AMThanks guys for the compliments,
but I don't think Raymond, who started this post with interesting code (unlike you), will like them very much... :sad:
And I do not think that MichaelW, who has not even been around here (since November 02, 2018) to defend himself against folks like you, will much like
your comments either.
Quote from: ognil on March 07, 2025, 02:07:44 AMI saw MichaelW's post and was immediately disappointed and lost interest in it.
...
This is another example of bad programming practice that is widely used in this forum. :sad:
Conversely, raymond still frequents the forum almost daily and can
speak for himself.
Apologies to raymond, for the nonsense that has been going on here in this thread by myself and others.
the code posted by ognil makes me think that it's original target was Linux, in Linux rdi is the first integer argument to a function, whereas in Windows it's rcx
also the code fails to preserve the non-volatile registers rdi and rsi
also, raymond's code 2 times faster
Quote from: jack on March 07, 2025, 01:00:30 PMthe code posted by ognil makes me think that it's original target was Linux, in Linux rdi is the first integer argument to a function, whereas in Windows it's rcx
also the code fails to preserve the non-volatile registers rdi and rsi
also, raymond's code 2 times faster
So whaddya expect from "AI"?
@Zedd
Sqrt(1000000) = 1000,that's kinda approximation I want to try with string reducing approximately by halved zeros after first digit
for 2 to sqrt(n)
Quote from: ognil on March 07, 2025, 10:51:25 AMThanks guys for the compliments, :badgrin:
Just for fun, I wrote a testbed to get some timings for your
isqrt algo:
This program was assembled with UAsm64 in 64-bit format.
875 ticks for isqrt, result=1234567
797 ticks for isqrt, result=9876543
31 ticks for fsqrt, result=1234567
32 ticks for fsqrt, result=9876543
This program was assembled with ml64 in 64-bit format.
907 ticks for isqrt, result=1234567
828 ticks for isqrt, result=9876543
15 ticks for fsqrt, result=1234567
32 ticks for fsqrt, result=9876543
I made a simple test in FreeBasic, I replaced labels with numbers otherwise it may give a duplicate symbol error, also replaced rsi with r9 which is volatile
hope you guys won't mind
' compile with fbc64
#cmdline "-asm intel -w all -arch native -gen gcc -v"
Function asmsqrt Naked Cdecl(Byval n As Ulongint) As Ulongint
Asm
Push rbx ' Save callee-saved register
Push rdi ' Save callee-saved register
'mov rcx, rdi ' rcx = n (input)
'n is already in register rcx
Xor rbx, rbx ' rbx = result (initially 0)
mov rdi, 0 ' rdi = low (initially 0)
mov r9, rcx ' r9 = high (initially n)
0:
' Calculate mid = low + (high - low) / 2
mov rax, r9 ' rax = high
Sub rax, rdi ' rax = high - low
Shr rax, 1 ' rax = (high - low) / 2
Add rax, rdi ' rax = mid
' Compute mid^2 and check against n
mov r8, rax ' r8 = mid
mul rax ' rdx:rax = mid^2
cmp rdx, 0 ' Check if high 64 bits are zero
jne 1f ' If not, mid^2 > n
cmp rax, rcx ' Compare low 64 bits with n
jg 1f ' If mid^2 > n
' Update result and adjust low
mov rbx, r8 ' Update result to mid
Add r8, 1 ' mid + 1
mov rdi, r8 ' low = mid + 1
jmp 2f
1:
' Adjust high
Sub r8, 1 ' mid - 1
mov r9, r8 ' high = mid - 1
2:
' Check if low <= high
cmp rdi, r9
jle 0b
' Return result
mov rax, rbx ' Move result to rax
Pop rdi ' Restore callee-saved register
Pop rbx ' Restore callee-saved register
ret
End Asm
End Function
Dim As Ulongint n, root
Dim As Longint sum
Dim As Double t
t=Timer
sum=0
For n=1 To 10000000
root=asmsqrt(n)
'root=fix(sqr(n))
sum+=root
Next
t=timer-t
Print sum
Print t
Print "Press return to end"
Sleep
Is isqrt for CPUs like ARM, that don't have FPU ?
Quote from: TimoVJL on March 07, 2025, 10:36:45 PMIs isqrt for CPUs like ARM, that don't have FPU ?
Maybe arduino and 8 bit cpus that is used in embedded?
Quote from: daydreamer on March 07, 2025, 11:14:53 PMQuote from: TimoVJL on March 07, 2025, 10:36:45 PMIs isqrt for CPUs like ARM, that don't have FPU ?
Maybe arduino and 8 bit cpus that is used in embedded?
Do you even know, what is ARM CPU family ?
QuoteJust for fun, I wrote a testbed to get some timings for your isqrt algo:
Thank you jj for the compliment and recognition of my code for actively participating in your pleasures. :badgrin:
I became a C++ guy from university with two projects implemented in my bank, related to transferring all S.W.I.F.T. daily transactions to the bank's general ledger.
Therefore, it is important for me that the program runs without errors, since the hardware takes care of the speed.
With the availability of modern hardware, optimizing the code has long been a thing of the past and is now just a distant memory for old people. :sad:
Quote from: ognil on March 08, 2025, 03:41:19 AMWith the availability of modern hardware, optimizing the code has long been a thing of the past and is now just a distant memory for old people. :sad:
Hah! What you seem to have missed--what sailed right over your head--is that your "modern hardware" code (using XMM) was a couple orders of magnitude SLOWER than legacy (FPU) code, as demonstrated by JJ's timings above.
Do you even bother reading replies here? One gets the feeling that you're like a "write-only" bot.
One more, using FastMath (https://www.jj2007.eu/MasmBasicQuickReference.htm#Mb1526)
282 ms for the FPU
608 ms for FastMath
X Y fpu Y FastMath
990.0 31.46427 31.46427
991.0 31.48015 31.48015
992.0 31.49603 31.49603
993.0 31.51190 31.51190
994.0 31.52777 31.52777
995.0 31.54362 31.54362
996.0 31.55947 31.55947
997.0 31.57531 31.57531
998.0 31.59114 31.59114
999.0 31.60696 31.60696
1000.0 31.62278 31.62278
The fpu is pretty fast :biggrin:
Twice as fast as the FastMath version, and a factor 33 faster than Ognil/Lingo's version.
NoCforMe,
I don't know what you're taking to calm your nerves, but maybe they need to increase your dosage because your nervous depression doesn't allow you to differentiate between hardware (something like NVMe SSD) and software (something like XMM). :sad:
JJ's "speed tests" are always manipulated depending on his current emotional state and are practically completely useless and unnecessary to anyone.
I'm not Lingo and I've already said that I don't care about optimizing code speed.
I suggest you stop here and go for a walk in nature for a while.
Quote from: ognil on March 08, 2025, 08:44:54 AMJJ's "speed tests" are always manipulated
Prove it, or I'll call you a liar :biggrin:
Your code is hopelessly slow, but it's a nice intellectual exercise, so I liked it. And I don't care if it's AI or genuine ;-)
Quote from: ognil on March 08, 2025, 08:44:54 AMyour nervous depression doesn't allow you to differentiate between hardware (something like NVMe SSD) and software (something like XMM)
What???That's provably, laughably false: XMM refers to a set of 128-bit
registers (therefore nothing like software. (And technically they fall under the heading of "SSE".)
Your credibility here is slipping fast, man.
NoCforMe,
I agree with you that you, jj and zed are the greatest programmers here and everyone else on the forum (including me) should take your advice as law and bow down to the elegance and speed of your code.
Now please calm down and take a walk. :thumbsup: