Got into a brief discussion recently in another forum about possibilities of assembly vs other languages. Following is one answer I got:
Quoterayfil wrote:
"Show me any other language which could extract the square root of any number with a precision of 1000 decimal digits within a split second."
Following Python snippet can extract the square root with a precision of 10000 decimal digits in 0.01 second.
CODE
from decimal import *
from time import time
def sq(n, pr):
start = time()
getcontext().prec = pr
print(Decimal(n).sqrt()
print(time() - start)
sq(5, 10000)
Mathematica is even faster than Python.
I would be VERY curious to see a finished program which could deliver such a feat, i.e. 0.01 second for 10,000 digits.
Is there anyone
a) sufficiently familiar with Python to understand the offered snippet,
b) capable of generating a working exe to confirm its speed and its output
If proven, its algo would be extremely interesting.
Thanks
hi Raymond
on my PC python takes 0.035 seconds
I have been playing with MP floating point arithmetic for a while in FreeBasic, my algorithms are very simple, the following little program builds an mp float from a 10000 characters long string and and takes the square root, squares it and gets the relative error, the average time per loop is 0.011 seconds
mind you that there's overhead in building the number from string and then squaring it to get the relative error
the same algorithms in C took 0.0054 seconds
dim as decfloat x, y
dim as long i
dim as double t
t=timer
for i=1 to 9
x=trim(str(i))+string(NUMBER_OF_DIGITS-1, trim(str(i))) ' build a 10000 digits number from string
y=sqr(x) 'square root
y=(y*y-x)/x 'calculate the relative error
print fp2str(y, 20)
next
t=timer-t
print t/9
<obviously the decimal floating point package is not included, if you are interested I can upload it and post a link>
QuoteStefan Krah's mpdecimal package (libmpdec): a complete implementation of the General Decimal Arithmetic Specification that will – with minor restrictions – also conform to the IEEE 754-2008 Standard for Floating-Point Arithmetic. Starting from Python-3.3, libmpdec is the basis for Python's decimal module.
mpdec (https://www.bytereef.org/mpdecimal/)
This fixed version run, but don't print nothing :sad: from decimal import *
from time import time
def sq(n, pr):
start = time()
getcontext().prec = pr
print(Decimal(n).sqrt())
print(time() - start)
sq(5, 10000)
With x64 version:
0.18700003623962402
for people on Windows I recommend WinPython (https://winpython.github.io/) the download is huge but includes the most popular libraries like SciPy and many others
Quote from: jack on November 28, 2023, 06:43:20 PMI recommend WinPython (https://winpython.github.io/)
See Using a Masm DLL from Python (https://masm32.com/board/index.php?topic=6356.0) or this (https://masm32.com/board/index.php?msg=45015):
include \masm32\MasmBasic\MasmBasic.inc ; download
includelib \Python34\libs\python34.lib ; adjust path according to your installation
Py_Initialize PROTO C
Py_Finalize PROTO C
PyRun_SimpleString PROTO C :DWORD
Init
call Py_Initialize
fn PyRun_SimpleString, "print('Hello World')"
call Py_Finalize
Exit
end start
That was over 8 years ago, and now intense googling has not produced any python*.lib. They seem to fumble a lot with this language; version 3.8, for example, is the last one that works on Windows 7: "Last WinPython version that is said to still work on Windows 7 should be WinPython64-3.8.9.0"
In the phone program is a little slow:
0.221851
:biggrin:
python 3.9(64-bit)
Quote0.04675626754760742
Quote from: jj2007 on November 28, 2023, 08:34:51 PMThat was over 8 years ago, and now intense googling has not produced any python*.lib.
You may try to download Windows Installer from the official site https://www.python.org/downloads/ (https://www.python.org/downloads/)
python*.lib resides in libs folder
Quote from: GoneFishing on November 29, 2023, 12:43:24 AMdownload Windows Installer from the official site https://www.python.org/downloads/
python*.lib resides in libs folder
I've done that in the meantime, using this archive (https://www.python.org/ftp/python/3.8.9/python-3.8.9-embed-win32.zip). There's a problem, though:
include \masm32\include\masm32rt.inc
.data
hDll dd ?
hPyList_New dd ?
.code
start:
mov hDll, rv(LoadLibrary, "\Python\python3.dll")
print hex$(eax), 9, "LoadLib", 13, 10
mov hPyList_New, rv(GetProcAddress, hDll, "PyList_New")
print hex$(rv(GetLastError)), 9, "last error", 13, 10
print hex$(hPyList_New), 9, "PyList_New", 13, 10
exit
end start
Output:
6DD50000 LoadLib
000000C1 last error
00000000 PyList_New
Googling
0xc1 loadlibrary python yields thousands of hits. Error 0xC1 = dec 193 means
bad exe format. So it
loads the library correctly as 32-bit code but fails to get the address of PyList_New, which is according to Timo's TLPEView, part of python3.dll :cool:
You can solve the problem with
invoke SetCurrentDirectory, chr$("\Python\")
Apparently, python3.dll loads libraries not from its own folder but rather 1. from the executable's folder and then 2. from Windows\System32, where it finds 64-bit DLLs. It's probably a feature :cool:
That ftp link was useful :thumbsup:
I had to download 32-bit version,then I copied python*.dlls to my test project folder
and with a minor adjustment:
mov hDll, rv(LoadLibrary, "python3.dll")
all worked fine:
Quote73EE0000 LoadLib
00000000 last error
715A3A30 PyList_New
Quote from: GoneFishing on November 29, 2023, 02:53:00 AMI copied python*.dlls to my test project folder
And all worked fine, sure. But you would have to do that for all your projects, thus wasting an awful lot of disk space.
Next version of MasmBasic will have this macro, so that you can access the DLL from any folder:
SetDllFolder MACRO arg:=<0>
ifndef sdfOld$
.DATA?
sdfOld$ db 260 dup(?) ; MAX_PATH
.CODE
endif
ifdif <arg>, <0>
.if 1
push repargA(arg)
invoke GetCurrentDirectory, 260, addr sdfOld$
call SetCurrentDirectory
else
invoke SetCurrentDirectory, addr sdfOld$
.endif
endif
ENDM
Usage:
SetDllFolder "\Python"
mov hDll, rv(LoadLibrary, "Python3")
mov func1, rv(GetProcAddress, hDll, "PyList_New")
...
SetDllFolder
I prefer pure MASM32.
And you're right python wants to know the path to its folder which contains not only dlls but also python*.zip with all needed modules.
Your MasmBasic HelloWorld proggie outputs a lot of internal configuration info when it doesn't find any modules:
QuotePython path configuration:
PYTHONHOME = (not set)
PYTHONPATH = (not set)
program name = 'python'
isolated = 0
environment = 1
user site = 1
import site = 1
sys._base_executable = 'C:\\masm32\\projects\\python\\39x32\\t2.exe'
sys.base_prefix = ''
sys.base_exec_prefix = ''
sys.platlibdir = 'lib'
sys.executable = 'C:\\masm32\\projects\\python\\39x32\\t2.exe'
sys.prefix = ''
sys.exec_prefix = ''
sys.path = [
'C:\\masm32\\projects\\python\\39x32\\python39.zip',
'.\\DLLs',
'.\\lib',
'C:\\masm32\\projects\\python\\39x32',
]
Quote from: jack on November 28, 2023, 12:55:11 PMhi Raymond
on my PC python takes 0.035 seconds
Many thanks to all those having shown interest in this subject.
Jack indicates that the quoted 0.01 second may be quite possibly achievable.
My main interest was to find out if there is a significantly different way of extracting a square root compared to all those I have learned in the past.
The most promising one for speed would have to be using logarithms. But then, are there means to convert those rapidly back and forth with sufficient precision? Or is it something totally different? A self-contained working program would certainly be useful to run under ollydbg and get some of the details.
my algorithm uses log and exp evaluated using double precision as a first approximation, thereafter I use the Newton-Raphson method, something like this
x0 = x scaled so it's between 1 and 10 but less than 10
ex = the exponent in base 10 of x
y0 = first approximation
y0 = exp(log(x0)/2)*exp(2.302585092994046*ex/2) ; the second exp is mp-exp evaluated only to 16 digits
prec=32
then do the Newton-Raphson calculations in a loop doubling the precision each time in the loop
the calculations inside the loop are only evaluated to prec precision
Paul Dixon posted an implementation of the square root in PowerBasic with inline asm https://forum.powerbasic.com/forum/user-to-user-discussions/programming/816462-arbitrary-length-number-math?p=816479#post816479 (https://forum.powerbasic.com/forum/user-to-user-discussions/programming/816462-arbitrary-length-number-math?p=816479#post816479)
but the Newton-Raphson method is several orders of magnitude faster than his approach
just in case that someone would like to have a look at my humble implementation of multiple precision decimal floating-point I attach the code in FreeBasic DecFloat-FB (https://u.pcloud.link/publink/show?code=XZLlhy0ZqQ0UKHkOkTQ9kVAIgwqebmjFm5y0)
hi Raymond,
QuoteA self-contained working program would certainly be useful
In theory the small python script in your post can be converted to exe format.
I've used Pyinstaller for that purpose but the result doesn't look "debuggable" because of its size :
exe = 1018KB + dependency folder = 10.2MB
BTW performance improved to 0.031305789947509766
Quote from: GoneFishing on November 29, 2023, 04:43:20 AMI prefer pure MASM32.
The SetDllFolder macro
is pure Masm32. You are working with MASM, where the first M stands for "MACRO". Get used to it.
Has anybody succeeded in running Ray's code?
from decimal import *
from time import time
def sq(n, pr):
start = time()
getcontext().prec = pr
print(Decimal(n).sqrt()
print(time() - start)
sq(5, 10000)
C:\Python>python.exe jjTest.py
File "jjTest.py", line 8
print(time() - start)
^
SyntaxError: invalid syntax
This is
after eliminating the
unexpected indent crap.
Edit: I found the reason, and will never seriously touch a language that has such horrible error handling.
Quote from: jj2007 on November 29, 2023, 07:27:04 AMHas anybody succeeded in running Ray's code?
As you could guess I've succeeded:
Quoteprint(Decimal(n).sqrt())
Unindent last line:
from decimal import *
from time import time
def sq(n, pr):
start = time()
getcontext().prec = pr
print(Decimal(n).sqrt())
print(time() - start)
sq(5, 10000)
just one last post of mine, here's a 64-bit exe plus source for testing
<the exe should work as is but I discovered a bug in my fp to string conversion that I need to track down>
<edit; got rid of non-working code and compiled without -march=native, should work on older CPU>
Quote from: jack on November 29, 2023, 07:54:22 AMhere's a 64-bit exe
Well done, it compiles "out of the box" and it's a factor 5 faster than the Python version :thumbsup:
Quote from: jack on November 29, 2023, 07:54:22 AMjust one last post of mine, here's a 64-bit exe plus source for testing
<the exe should work as is but I discovered a bug in my fp to string conversion that I need to track down>
Crash in Windows 7 :sad:
Just a problem with an old AMD
00000001`3f520a51 c5f829b424d0000000 vmovaps xmmword ptr [rsp+0D0h],xmm6 ss:00000000`0052fca0=00000000000000000000000000000000
thanks for testing jj and TimoVJL
@TimoVJL, would try again?
I recompiled without the -march=native so it should work on older CPU's but probably not as fast
Same problem00000001`3fe4f891 c5f829b424d0000000 vmovaps xmmword ptr [rsp+0D0h],xmm6 ss:00000000`0055f7f0=00000000000000000000000000000000
TimoVJL, one last try please, I recompiled on a Windows 7 VM
time per loop = 0.0642072000756571 :thumbsup:
hi TimoVJL, looks like I forgot to change the time message but at least it runs
how does the time compare with python?
from post #2
With x64 version:
0.18700003623962402
Attached a list of the 1600 functions exported by Python38.dll (direct link (https://www.python.org/ftp/python/3.8.9/python-3.8.9-embed-win32.zip))
Note you get a similar list from Python3.dll, which grabs about 800 functions from *38.dll; they probably keep it for backwards compatibility.
I noticed strange behaviour of PySys_WriteStdout: it writes 0D0D0Ah sequences, see Assembly uses Python (https://masm32.com/board/index.php?topic=11506.0) (warning, that's MasmBasic :biggrin: )
PySys_WriteStdout(cfm$("%i\t%s\n"), ecx, PyList_GetItem(edi, ecx))
I'm not sure whether it's Python's fault, or if the console plays tricks.
Quote from: GoneFishing on November 29, 2023, 06:41:37 AMhi Raymond,
QuoteA self-contained working program would certainly be useful
In theory the small python script in your post can be converted to exe format.
I've used Pyinstaller for that purpose but the result doesn't look "debuggable" because of its size :
exe = 1018KB + dependency folder = 10.2MB
WOW Talk about BLOATING to simply extract a square root!!!! :skrewy: :dazzled: :eusa_dance: :eusa_boohoo:
Quote from: jack on November 29, 2023, 07:54:22 AMjust one last post of mine, here's a 64-bit exe plus source for testing
<the exe should work as is but I discovered a bug in my fp to string conversion that I need to track down>
<edit; got rid of non-working code and compiled without -march=native, should work on older CPU>
D/L, unzipped & ran without problem (except for one minor detail: it "prints" some extra 100 or so blank pages).
0.022 sec on my old 10-year old laptop under Win10. Almost 100 times faster than my BCD algo at 1.5 second. All 10,000 digits of the square root of 5 seem to be the same with both programs.
Is there something similar as the 32-bit Ollydbg to study this 64-bit program?
Quote from: raymond on November 30, 2023, 08:14:34 AMIs there something similar as the 32-bit Ollydbg to study this 64-bit program?
x64Dbg (https://x64dbg.com/) has the typical Olly interface, and it works fine.
raymond
what algorithm did you use for the square root?
Quote from: raymond on November 29, 2023, 04:56:32 AMMy main interest was to find out if there is a significantly different way of extracting a square root compared to all those I have learned in the past.
The most promising one for speed would have to be using logarithms. But then, are there means to convert those rapidly back and forth with sufficient precision? Or is it something totally different? A self-contained working program would certainly be useful to run under ollydbg and get some of the details.
I don't know if Taylor series has sufficient precision for log and exp functions?
I have SIMD trigo functions that calculate 4*sincos fast
y0 = exp(log(x0)/2)*exp(2.302585092994046*ex/2) ;possibility for SIMT or SIMD calculates *0.5 and exp functions in parallel?
Quotey0 = exp(log(x0)/2)*exp(2.302585092994046*ex/2)
I wrote it that way at the time because I was playing with nth roots, but for the square root it's simpler to use sqrt(x0)*exp(2.302585092994046*ex/2), the reason for splitting the number is so that you can deal with huge numbers beyond the range of double or even extended
so for example if x=17, then x0=1.7 and ex=1
but suppose x=5e10000 then x0=5 and ex=10000 so exp(2.302585092994046*ex/2) = 1.00000000000158e5000 so you need to use MPexp to evaluate that expression
I tried the inverse square root Newton-Rapson method but with my very simple multiplication algorithm it was slower.
Quote from: jack on November 30, 2023, 10:53:01 AMraymond
what algorithm did you use for the square root?
When I first learned about this algo in the early 1960s, I dubbed it the "abacus" algo because it seemed as if it could have been developed in ancient times when the abacus was the prime instrument to perform calculations.
In those days, IBM was marketing mechanical calculators to perform primarily the usual math functions: add, subtract, multiply, divide. The user's manual also described a procedure to extract square roots without using any mult/div operations except for one single preliminary multiplication. That is the procedure which I adapted in the late 1980's for a TRS80 to achieve 10,000 decimal precision of square roots with only 64kb total memory (using what was named at the time as "machine language" :skrewy: )!
Here's a brief description of that algo.
Let's use the square root of 2 for example
;1st step is to multiply the source by 5
2.0000....*5 = 10.0000.....
;2nd step subdivide the result into chunks of 2 digits from the decimal delimiter, pad front with a 0 if necessary
10 00 00 00 ...........
;start by subtracting 05 from the first double digits
05
__
05 00 00 00 ......
;continue incrementing the '05' by 10 for the subsequent subtractions until the result would be negative
15
__
;would be negative for this example
;remove the terminating 5 and replace it with a 05, then shift by 1 digit right and repeat subtractions
05 00 00 00 ..........
01 05
______________
03 95 00 00 ....
01 15
______________
02 80 00 00 ....
01 25
______________
01 55 00 00 .........
01 35
______________
00 20 00 00 ......
01 45
;would become negative for this next digit
;remove the terminating 5 and replace it with a 05, then shift by 1 digit right and repeat subtractions
00 20 00 00 ...........
00 14 05
_____________
00 05 95 00 ..........
00 14 15
;would become negative for this next digit
;remove the terminating 5 and replace it with a 05, then shift by 1 digit right and repeat subtractions
00 05 95 00 ..........
00 01 41 05
_____________
00 04 53 95 ..........
00 01 41 15
_____________
00 03 12 80 ..........
00 01 41 25
_____________
00 01 71 55 ..........
00 01 41 35
_____________
00 00 30 20
00 01 41 45
;would become negative for this next digit
;remove the terminating 5 and replace it with a 05, then shift by 1 digit right and repeat subtractions
00 00 30 20 00 00 ........
00 00 14 14 05
________________
00 00 16 05 95 00 .......
00 00 14 14 15
________________
00 00 01 91 80 00 .........
00 00 14 14 25
;would become negative for this next digit
;remove the terminating 5 and replace it with a 05, then shift by 1 digit right and repeat subtractions
00 00 01 91 80 00 .........
00 00 01 41 42 05
;and you can continue to repeat this ad infinitum......... limited to the amount of available memory.
You also have to keep track of where the decimal delimiter would be in the result.
This somehow makes me feel old. :dazzled:
:biggrin:
interesting algorithm
SSE rcpsqrt low-res LUT I use in make circles very fast in ddraw, documentation suggest use newton- raphson to perform better precision
my dos. Com file is inspired by pixelshader way of make circles per pixel checking every pixel if inside or outside circle radius r= sqrt(x*x+Y*y)
Fastest I use in dos. Com is reduce the slow fsqrt every pixel vs 100 radius, with
R=x*x+Y*y
.If R < 10000 ; check vs 100*100
Mov color,circle color
.else
Mov color,background color
.endif