good day all :)
for a long time I had wished that my favorite Basic Compiler would have 128-bit integers with the associated arithmetic and was wondering about an efficient cross-platform implementation, either in FreeBasic or in C
you could use inline assembler but then you would have to deal with each different architecture and operating system, so a high level approach would be easier.
integer 64-bit division is quite slow and was wondering whether using floating-point would be faster, I was thinking of using triple-double floating point as described in the crlibm GitHub repository (https://github.com/taschini/crlibm) and in the pdf (https://ens-lyon.hal.science/ensl-01529804v1/document)
what are your thoughts?
one troublesome detail is the conversion to/from int128 and triple-double
the pdf file doesn't address division but the file triple-double.h has a reciprocal function which looks quite complex
<edit> forgot to mention that it needs to work in 32 and 64-bits
See QuadMath (https://www.jj2007.eu/MasmBasicQuickReference.htm#Mb1453). Requires libquadmath.a
https://www.freebasic.net/forum/viewtopic.php?t=19208 (https://www.freebasic.net/forum/viewtopic.php?t=19208)
jj, I thought about that but as far as I can tell QuadMath is FP only, I am interested in 128-bit integer only, even if implemented using FP
QuadMath would be able to hold an integer of about 112-bits but with a ton of unwanted luggage
here is a possible solution UInt128-Division-Modulus (https://www.codeproject.com/Tips/785014/UInt128-Division-Modulus)
even though the license is permissive, the ideal license would be public domain
Quote from: jack on March 27, 2025, 02:22:50 AMas far as I can tell QuadMath is FP only
GccInt.pdf (https://gcc.gnu.org/onlinedocs/gcc-10.5.0/gccint.pdf)
4.1 Routines for integer arithmetic
The integer arithmetic routines are used on platforms that don't provide hardware support
for arithmetic operations on some modes.
4.1.1 Arithmetic functions
[Runtime Function]
int __ashlsi3 (int a, int b)
[Runtime Function]
long __ashldi3 (long a, int b)
[Runtime Function]
long long __ashlti3 (long long a, int b)
These functions return the result of shifting a left by b bits.
[Runtime Function]
int __ashrsi3 (int a, int b)
[Runtime Function]
long __ashrdi3 (long a, int b)
[Runtime Function]
long long __ashrti3 (long long a, int b)
These functions return the result of arithmetically shifting a right by b bits.
[Runtime Function]
int __divsi3 (int a, int b)
[Runtime Function]
long __divdi3 (long a, long b)
[Runtime Function]
long long __divti3 (long long a, long long b)
These functions return the quotient of the signed division of a and b.
[Runtime Function]
int __lshrsi3 (int a, int b)
[Runtime Function]
long __lshrdi3 (long a, int b)
[Runtime Function]
long long __lshrti3 (long long a, int b)
These functions return the result of logically shifting a right by b bits.
10
GNU Compiler Collection (GCC) Internals
[Runtime Function]
int __modsi3 (int a, int b)
[Runtime Function]
long __moddi3 (long a, long b)
[Runtime Function]
long long __modti3 (long long a, long long b)
These functions return the remainder of the signed division of a and b.
[Runtime Function]
int __mulsi3 (int a, int b)
[Runtime Function]
long __muldi3 (long a, long b)
[Runtime Function]
long long __multi3 (long long a, long long b)
These functions return the product of a and b.
[Runtime Function]
long __negdi2 (long a)
[Runtime Function]
long long __negti2 (long long a)
These functions return the negation of a.
[Runtime Function]
unsigned int __udivsi3 (unsigned int a, unsigned int b)
[Runtime Function]
unsigned long __udivdi3 (unsigned long a, unsigned long b)
[Runtime Function]
unsigned long long __udivti3 (unsigned long long a,
unsigned long long b)
These functions return the quotient of the unsigned division of a and b.
[Runtime Function]
unsigned long __udivmoddi4 (unsigned long a, unsigned long
b, unsigned long *c)
[Runtime Function]
unsigned long long __udivmodti4 (unsigned long long a,
unsigned long long b, unsigned long long *c)
These functions calculate both the quotient and remainder of the unsigned division
of a and b. The return value is the quotient, and the remainder is placed in variable
pointed to by c.
[Runtime Function]
unsigned int __umodsi3 (unsigned int a, unsigned int b)
[Runtime Function]
unsigned long __umoddi3 (unsigned long a, unsigned long b)
[Runtime Function]
unsigned long long __umodti3 (unsigned long long a,
unsigned long long b)
These functions return the remainder of the unsigned division of a and b.
jj, none of those functions deal with 128-bit integers, so it's irrelevant
Hi Jack
Check this link https://masm32.com/board/index.php?topic=11801.msg128168#msg128168 (https://masm32.com/board/index.php?topic=11801.msg128168#msg128168)
Maybe this can help you.
Regards, Biterider
thanks Biterider :smiley:
I will have a look as soon as I wake up, haven't had my coffee yet ☕
hello Biterider
as a preliminary test, I adapted your uoooDiv procedure to FreeBasic and it runs OK, but a lot more testing needs to be done
I sent you my mod in a PM
dim as uint128 N = (&hFFFFFFFFFFFFFFFF, &h2222222222222222)
dim as uint128 D = (&h0000000000000FF1, &h1111111111111111)
dim as uint128 Q
uoooDiv(Q, N, D)
print128("Dividend (N)", N)
print128("Divisor (D)", D)
print128("Quotient (Q)", Q)
'output
'Dividend (N): 0xFFFFFFFFFFFFFFFF2222222222222222
'Divisor (D): 0x0000000000000FF11111111111111111
'Quotient (Q): 0x000000000000000000100EFCEC0F85F4
I ran that division in a loop of 1000000 and it took 0.031 seconds, the C++ code compiled with -O3 took .08 seconds, that's about 2.58 times slower
the C++ code returns the remainder along with the quotient so that will slow it down some
I managed to adapt Biterider's oword assembler multiply and divide routines to FreeBasic
an unexpected result in timing the high-level multiply vs the asm version, with gcc backend and with optimization level of -O3
multiply n times d 100000000 times in a loop
a sum of the product is kept and printed out
in order to prevent gcc from eliminating the loop
and thereby making the timing meaningless
assembler multiply
n = 340282366920938463447387429234553266722
d = 18446744073709551615
product = 45370982256125128475310893311622766046
sum of product = 113427455641665582386530236262442737152
elapsed time seconds 0.296251199994003
high-level multiply
n = 340282366920938463447387429234553266722
d = 18446744073709551615
product = 45370982256125128475310893311622766046
sum of product = 113427455641665582386530236262442737152
elapsed time seconds 0.0651214999961667
here's the whole enchilada if you want to try run it
the assembler routines are by Biterider (https://masm32.com/board/index.php?topic=11801.msg128168#msg128168)
Quote from: jack on March 29, 2025, 01:37:12 AMhere's the whole enchilada if you want to try run it
Doh! :undecided: My bad...
I expected that there was a Windows executable in the attachment to test. :biggrin:
I wanted to see how my i7 performed on this.
I try to avoid binaries but here it is
sorry, I forgot to zero the sum tor the second test, please re-download
but I suspect that gcc optimized the multiplications somehow, I mean beyond function optimization, am thinking that it probably evaluated the multiplication only once
Quote from: jack on March 29, 2025, 02:39:36 AMI try to avoid binaries but here it is
Intel(R) Core(TM) i7-7700 CPU @ 3.60GHz 3.60 GHz
multiply n times d 100000000 times in a loop
a sum of the product is kept and printed out
in order to prevent gcc from eliminating the loop
and thereby making the timing meaningless
n = 340282366920938463447387429234553266722
d = 18446744073709551615
product = 45370982256125128475310893311622766046
sum of product = 113427455641665582386530236262442737152
assembler version
elapsed time seconds 0.337129999999888
n = 340282366920938463447387429234553266722
d = 18446744073709551615
product = 45370982256125128475310893311622766046
sum of product = 226854911283331164773060472524885474304
high level version
elapsed time seconds 0.072414799942635
:smiley:
Yes, the C optimisations add a great improvement in performance.
How does the
unoptimsed C version fare compared to the FreeBasic version?
But wait, something is amiss.
'sum of product = 226854911283331164773060472524885474304' - for the high level version in my test?
double what it should be.
and 'sum of product = 113427455641665582386530236262442737152' - for the high level version in your test.
un-optimized times
n = 340282366920938463447387429234553266722
d = 18446744073709551615
product = 45370982256125128475310893311622766046
sum of product = 113427455641665582386530236262442737152
assembler version
elapsed time seconds 0.642031399998814
n = 340282366920938463447387429234553266722
d = 18446744073709551615
product = 45370982256125128475310893311622766046
sum of product = 113427455641665582386530236262442737152
high level version
elapsed time seconds 2.85223180000321
Quote from: zedd151 on March 29, 2025, 02:58:19 AMBut wait, something is amiss.
'sum of product = 226854911283331164773060472524885474304' - for the high level version in my test?
and 'sum of product = 113427455641665582386530236262442737152' - for the high level version in your test.
I had forgotten to zero out the sum for the second test, so it just kept adding, it's fixed now if you want to download the binary
Quote from: jack on March 29, 2025, 03:06:22 AMI had forgotten to zero out the sum for the second test, so it just kept adding, it's fixed now if you want to download the binary
:biggrin:
Intel(R) Core(TM) i7-7700 CPU @ 3.60GHz 3.60 GHz
multiply n times d 100000000 times in a loop
a sum of the product is kept and printed out
in order to prevent gcc from eliminating the loop
and thereby making the timing meaningless
n = 340282366920938463447387429234553266722
d = 18446744073709551615
product = 45370982256125128475310893311622766046
sum of product = 113427455641665582386530236262442737152
assembler version
elapsed time seconds 0.338004100020044
n = 340282366920938463447387429234553266722
d = 18446744073709551615
product = 45370982256125128475310893311622766046
sum of product = 113427455641665582386530236262442737152
high level version
elapsed time seconds 0.0723988000536337
press return to end
perfect. :azn:
Side note: A humongous difference between unoptimised C version and optimized C version. Biterider might have to tweak his code (or you may need to tweak your conversion) to be a contender. :tongue:
These huge numbers remind me of my experiments with my 'ascii_adder' routine. It added two humongous ascii decimal strings of any length together, outputting an ascii string on return. May be good for working with the U.S. national debt? :biggrin: