128-bit integer arithmetic using FP

jack · March 27, 2025, 12:29:22 AM

good day all :)
for a long time I had wished that my favorite Basic Compiler would have 128-bit integers with the associated arithmetic and was wondering about an efficient cross-platform implementation, either in FreeBasic or in C
you could use inline assembler but then you would have to deal with each different architecture and operating system, so a high level approach would be easier.
integer 64-bit division is quite slow and was wondering whether using floating-point would be faster, I was thinking of using triple-double floating point as described in the crlibm GitHub repository and in the pdf
what are your thoughts?

one troublesome detail is the conversion to/from int128 and triple-double
the pdf file doesn't address division but the file triple-double.h has a reciprocal function which looks quite complex
<edit> forgot to mention that it needs to work in 32 and 64-bits

jj2007 · March 27, 2025, 01:54:53 AM

See QuadMath. Requires libquadmath.a

TimoVJL · March 27, 2025, 02:18:18 AM

https://www.freebasic.net/forum/viewtopic.php?t=19208

jack · March 27, 2025, 02:22:50 AM

jj, I thought about that but as far as I can tell QuadMath is FP only, I am interested in 128-bit integer only, even if implemented using FP
QuadMath would be able to hold an integer of about 112-bits but with a ton of unwanted luggage
here is a possible solution UInt128-Division-Modulus
even though the license is permissive, the ideal license would be public domain

jj2007 · March 27, 2025, 03:56:08 AM

Quote from: jack on March 27, 2025, 02:22:50 AMas far as I can tell QuadMath is FP only

GccInt.pdf

4.1 Routines for integer arithmetic
The integer arithmetic routines are used on platforms that don't provide hardware support
for arithmetic operations on some modes.
4.1.1 Arithmetic functions
[Runtime Function]
int __ashlsi3 (int a, int b)
[Runtime Function]
long __ashldi3 (long a, int b)
[Runtime Function]
long long __ashlti3 (long long a, int b)
These functions return the result of shifting a left by b bits.
[Runtime Function]
int __ashrsi3 (int a, int b)
[Runtime Function]
long __ashrdi3 (long a, int b)
[Runtime Function]
long long __ashrti3 (long long a, int b)
These functions return the result of arithmetically shifting a right by b bits.
[Runtime Function]
int __divsi3 (int a, int b)
[Runtime Function]
long __divdi3 (long a, long b)
[Runtime Function]
long long __divti3 (long long a, long long b)
These functions return the quotient of the signed division of a and b.
[Runtime Function]
int __lshrsi3 (int a, int b)
[Runtime Function]
long __lshrdi3 (long a, int b)
[Runtime Function]
long long __lshrti3 (long long a, int b)
These functions return the result of logically shifting a right by b bits.
10
GNU Compiler Collection (GCC) Internals
[Runtime Function]
int __modsi3 (int a, int b)
[Runtime Function]
long __moddi3 (long a, long b)
[Runtime Function]
long long __modti3 (long long a, long long b)
These functions return the remainder of the signed division of a and b.
[Runtime Function]
int __mulsi3 (int a, int b)
[Runtime Function]
long __muldi3 (long a, long b)
[Runtime Function]
long long __multi3 (long long a, long long b)
These functions return the product of a and b.
[Runtime Function]
long __negdi2 (long a)
[Runtime Function]
long long __negti2 (long long a)
These functions return the negation of a.
[Runtime Function]
unsigned int __udivsi3 (unsigned int a, unsigned int b)
[Runtime Function]
unsigned long __udivdi3 (unsigned long a, unsigned long b)
[Runtime Function]
unsigned long long __udivti3 (unsigned long long a,
unsigned long long b)
These functions return the quotient of the unsigned division of a and b.
[Runtime Function]
unsigned long __udivmoddi4 (unsigned long a, unsigned long
b, unsigned long *c)
[Runtime Function]
unsigned long long __udivmodti4 (unsigned long long a,
unsigned long long b, unsigned long long *c)
These functions calculate both the quotient and remainder of the unsigned division
of a and b. The return value is the quotient, and the remainder is placed in variable
pointed to by c.
[Runtime Function]
unsigned int __umodsi3 (unsigned int a, unsigned int b)
[Runtime Function]
unsigned long __umoddi3 (unsigned long a, unsigned long b)
[Runtime Function]
unsigned long long __umodti3 (unsigned long long a,
unsigned long long b)
These functions return the remainder of the unsigned division of a and b.

jack · March 27, 2025, 05:22:46 AM

jj, none of those functions deal with 128-bit integers, so it's irrelevant

Biterider · March 27, 2025, 10:28:40 AM

Hi Jack
Check this link https://masm32.com/board/index.php?topic=11801.msg128168#msg128168
Maybe this can help you.

Regards, Biterider

jack · March 27, 2025, 07:56:23 PM

thanks Biterider

I will have a look as soon as I wake up, haven't had my coffee yet ☕

jack · March 27, 2025, 11:56:15 PM

hello Biterider
as a preliminary test, I adapted your uoooDiv procedure to FreeBasic and it runs OK, but a lot more testing needs to be done
I sent you my mod in a PM

Code Select

	dim as uint128 N = (&hFFFFFFFFFFFFFFFF, &h2222222222222222)
	dim as uint128 D = (&h0000000000000FF1, &h1111111111111111)
	dim as uint128 Q
	uoooDiv(Q, N, D)
	print128("Dividend (N)", N)
	print128("Divisor (D)", D)
	print128("Quotient (Q)", Q)
'output
'Dividend (N): 0xFFFFFFFFFFFFFFFF2222222222222222
'Divisor (D): 0x0000000000000FF11111111111111111
'Quotient (Q): 0x000000000000000000100EFCEC0F85F4

jack · March 28, 2025, 12:38:16 AM

I ran that division in a loop of 1000000 and it took 0.031 seconds, the C++ code compiled with -O3 took .08 seconds, that's about 2.58 times slower
the C++ code returns the remainder along with the quotient so that will slow it down some

jack · March 29, 2025, 01:37:12 AM

I managed to adapt Biterider's oword assembler multiply and divide routines to FreeBasic
an unexpected result in timing the high-level multiply vs the asm version, with gcc backend and with optimization level of -O3

Code Select

multiply n times d 100000000 times in a loop
a sum of the product is kept and printed out
in order to prevent gcc from eliminating the loop
and thereby making the timing meaningless

assembler multiply
n = 340282366920938463447387429234553266722
d = 18446744073709551615
product = 45370982256125128475310893311622766046
sum of product = 113427455641665582386530236262442737152
elapsed time seconds  0.296251199994003

high-level multiply
n = 340282366920938463447387429234553266722
d = 18446744073709551615
product = 45370982256125128475310893311622766046
sum of product = 113427455641665582386530236262442737152
elapsed time seconds  0.0651214999961667

here's the whole enchilada if you want to try run it
the assembler routines are by Biterider

zedd · March 29, 2025, 01:42:41 AM

Quote from: jack on March 29, 2025, 01:37:12 AMhere's the whole enchilada if you want to try run it

Doh!

My bad...

I expected that there was a Windows executable in the attachment to test.

I wanted to see how my i7 performed on this.

jack · March 29, 2025, 02:39:36 AM

I try to avoid binaries but here it is

sorry, I forgot to zero the sum tor the second test, please re-download

jack · March 29, 2025, 02:43:29 AM

but I suspect that gcc optimized the multiplications somehow, I mean beyond function optimization, am thinking that it probably evaluated the multiplication only once

zedd · March 29, 2025, 02:58:19 AM

Quote from: jack on March 29, 2025, 02:39:36 AMI try to avoid binaries but here it is

Intel(R) Core(TM) i7-7700 CPU @ 3.60GHz 3.60 GHz

Code Select

multiply n times d 100000000 times in a loop
a sum of the product is kept and printed out
in order to prevent gcc from eliminating the loop
and thereby making the timing meaningless

n = 340282366920938463447387429234553266722
d = 18446744073709551615
product = 45370982256125128475310893311622766046
sum of product = 113427455641665582386530236262442737152

assembler version
elapsed time seconds  0.337129999999888

n = 340282366920938463447387429234553266722
d = 18446744073709551615
product = 45370982256125128475310893311622766046
sum of product = 226854911283331164773060472524885474304

high level version
elapsed time seconds  0.072414799942635

Yes, the C optimisations add a great improvement in performance.
How does the unoptimsed C version fare compared to the FreeBasic version?

But wait, something is amiss.
'sum of product = 226854911283331164773060472524885474304' - for the high level version in my test?
double what it should be.

and 'sum of product = 113427455641665582386530236262442737152' - for the high level version in your test.

The MASM Forum

News:

128-bit integer arithmetic using FP

jack

jj2007

TimoVJL

jack

jj2007

jack

Biterider

jack

jack

jack

jack

zedd

jack

jack

zedd