The MASM Forum
General => The Campus => Topic started by: peter_asm on October 13, 2013, 06:13:08 AM

Hey folks,
This isn't a homework question and I have tried to do this myself but ..just can't get it to work and need some feedback.
I want to perform addition and subtraction on big numbers but I don't get the same results as those in big numbers calculator.
Having looked around the internet, I thought you would add big numbers on x86 cpu using ADD/ADC, like the following function for 256bit numbers.
; edi = esi + edi
bn_add proc uses esi edi
mov ecx, 8
clc
add_cycle:
mov eax, [esi]
adc eax, [edi]
mov [edi], eax
lea esi, [esi+4]
lea edi, [edi+4]
loop add_cycle
ret
bn_add endp
This looked fine to me until i compared the output of addition with big calculator.
It is related to the carry but i'm not entirely sure what.
Does anyone know why this wouldn't be compatible with most big number libraries that perform addition?
By the way, i realize this routine isn't optimized for speed but I really just want to understand the arithmetic...how it can be done, so simple as possible explanation would be welcome.

your are not updating the pointers in ESI and EDI:
add edi,4
add esi,4
inc ecx
jnz ...
an other problem is that you are changing the carry flag at end of each loop iteration sorry, INC/DEC does not affect the carry flag. The best is to unroll the loop:
mov eax,[esi]
add [edi],eax
mov eax,[esi+4]
adc [edi+4],eax
...
EDIT

I updated the code.
When I try adding 2 big numbers from calculator
esi = EABC39EBD635C26C
edi = AA469F19BA22C209
The output is
9502D904 90588475 1
The big calculator gives me:
1 9502D905 90588475
the numbers are mostly correct except for 2nd and order of course.

TEST does set the carry flag to zero. Use DEC instead.

esi = EABC39EBD635C26C
edi = AA469F19BA22C209
i think you mean [esi] and [edi] :P
intel processors use "littleendian" format
that is, the lowerorder bytes are at lower addresses
the code in your original post is almost correct, assuming ESI and EDI are preloaded with addresses
bn_add proc uses esi edi
mov ecx, 8
clc
add_cycle:
mov eax, [esi]
adc eax, [edi]
dec ecx ;sets the ZERO FLAG
mov [edi], eax
lea esi, [esi+4]
lea edi, [edi+4]
jnz add_cycle
ret
bn_add endp
you have another free register, EDX, so you could sort of unroll once, as qWord suggested...
bn_add proc uses esi edi
mov ecx, 4
clc
add_cycle:
mov eax, [esi]
mov edx, [esi+4]
adc eax, [edi]
adc edx, [edi+4]
dec ecx ;sets the ZERO FLAG
mov [edi], eax
mov [edi+4], edx
lea esi, [esi+8]
lea edi, [edi+8]
jnz add_cycle
ret
bn_add endp
with 4 passes, you could even get rid of the loop, altogether :biggrin:

bn_add proc
mov eax, [esi]
mov ecx, [esi+4]
mov edx, [esi+8]
add eax, [edi]
adc ecx, [edi+4]
adc edx, [edi+8]
mov [edi], eax
mov [edi+4], ecx
mov [edi+8], edx
mov eax, [esi+12]
mov edx, [esi+16]
adc eax, [edi+12]
adc edx, [edi+16]
mov [edi+12], eax
mov [edi+16], edx
mov eax, [esi+20]
mov ecx, [esi+24]
mov edx, [esi+28]
adc eax, [edi+20]
adc ecx, [edi+24]
adc edx, [edi+28]
mov [edi+20], eax
mov [edi+24], ecx
mov [edi+28], edx
ret
bn_add endp

maybe faster....
bn_add proc
mov eax, [esi]
mov ecx, [esi+4]
mov edx, [esi+8]
add [edi], eax
adc [edi+4], ecx
adc [edi+8], dx
mov eax, [esi+12]
mov edx, [esi+16]
adc [edi+12], eax
adc [edi+16], edx
mov eax, [esi+20]
mov ecx, [esi+24]
mov edx, [esi+28]
adc [edi+20], eax
adc [edi+24], ecx
adc [edi+28], edx
ret
bn_add endp

i modified the last 2 posts  there is no need to preserve ESI and EDI, because they aren't altered

Hey,
Thanks for all responses!
However, the problem appears to be with the actual numbers I'm using in addition.
They are big endian format or at least should be added to be compatible with big endian machine.
The carry is the problem really, different for big/little endian.
I'll figure it out eventually.

what bigendian machine ??? :redface:
The Intel x86 and x8664 series of processors use the littleendian format, and for this reason, the littleendian format is also known as the Intel convention. Other wellknown littleendian processor architectures are the 6502 (including 65802, 65C816), Z80 (including Z180, eZ80 etc.), MCS48, 8051, DEC Alpha, Altera Nios II, Atmel AVR, VAX, and, largely, PDP11.
so, unless you are writing code for a motorola 6800/68000 processor.....
here is a simple little test piece
i took the last code i posted and made a macro out of it :P
;###############################################################################################
.XCREF
.NoList
INCLUDE \Masm32\Include\Masm32rt.inc
.List
;###############################################################################################
AddOword MACRO
mov eax,[esi]
mov ecx,[esi+4]
mov edx,[esi+8]
add [edi],eax
adc [edi+4],ecx
adc [edi+8],dx
mov eax,[esi+12]
mov edx,[esi+16]
adc [edi+12],eax
adc [edi+16],edx
mov eax,[esi+20]
mov ecx,[esi+24]
mov edx,[esi+28]
adc [edi+20],eax
adc [edi+24],ecx
adc [edi+28],edx
ENDM
;###############################################################################################
.DATA
ALIGN 4
owVal1 dd 9ABCDEF0h,12345678h
owVal2 dd 11111011h,11111111h
;###############################################################################################
.CODE
;***********************************************************************************************
_main PROC
mov esi,offset owVal1
mov edi,offset owVal2
print chr$(' Val1: ')
print uhex$([esi+4])
print uhex$([esi]),13,10
print chr$(' Val2: ')
print uhex$([edi+4])
print uhex$([edi]),13,10
AddOword
print chr$('Total: ')
print uhex$([edi+4])
print uhex$([edi]),13,10
print chr$(13,10)
inkey
exit
_main ENDP
;###############################################################################################
END _main

When I try adding 2 big numbers from calculator
esi = EABC39EBD635C26C
edi = AA469F19BA22C209
The output is
9502D904 90588475 1
The big calculator gives me:
9502D905 90588475 1
you'll notice the first number has extra one added to it.
It's something to do with endianness from what i'm reading
http://stackoverflow.com/questions/13926760/thereasonbehindendianness
Another interesting explanation that I found concerns addition and subtraction. When adding or subtracting multibyte numbers, the least significant byte must be fetched first to see if there is a carryover to more significant bytes. Because the leastsignificant byte is read first in littleendian numbers, the system can parallelize and begin calculation on this byte while fetching the following byte(s).

first, redesign your program for littleendian values
it will make life much simpler when you add other things :t
when you use
dword 12345678h
the bytes are stored as
78 56 34 12
second, start with 2 values whose sum does not exceed the range of an unsigned OWORD
you have chosen 2 values that, when added together, cause a 128bit overflow

I've got the correct values now but I had to add backwards.
mov ecx, 8
clc
add_cycle:
mov eax, [esi+4*ecx4]
adc [edi+4*ecx4], eax
loop add_cycle
Tried to reverse the bytes but it gave same problems.

i reversed the order first and then use bswap in loop which works too.
mov ecx, 8
clc
add_cycle:
lodsd
mov ebx, [edi]
bswap eax
bswap ebx
adc eax, ebx
stosd
loop add_cycle
thanks for your help :)

perhaps you misunderstand what i meant
if you are using BSWAP for all your data, you haven't grasped what i tried to explain
probably my fault, as i haven't talked about this much
intel processors store and manipulate numbers in loworderbytefirst fashion
if you define a DWORD value, then load it into a register, you will see the effect
.DATA
Val01 dd 11223344h
;in memory, Val01 looks like this:
; 44 33 22 11
;(hexadecimal format)
;the loworder bytes are at lower addresses
.CODE
mov eax,Val01 ;AL = 44h, AH = 33h, and so on
it simply must be that way for many instructions to work properly
INC is a good example, because if we use the instruction on Val01...
inc Val01
the loworder byte of Val01 will be incremented
if it rolls over from 0FFh to 0, the second byte will be incremented
if the second byte rolls over from 0FFh to 0, the third byte will be incremented
if the third byte rolls over from 0FFh to 0, the forth byte will be incremented
in order for the instruction to work correctly on a DWORD in memory, the bytes MUST be in littleendian order
to extend from a single DWORD to a longer "string" of DWORD's, we would place the highorder DWORD's at higher addresses
Val01 dd 33221100h,77665544h
;Val01 looks like this in memory
; 00 11 22 33 44 55 66 77
if you like, you may post a complete example of your code,
and i (or we) can show you how to do it using littleendian values, without the use of BSWAP

OKay, I'm using BNCALC to generate the numbers.
http://sourceforge.net/projects/bncalc/

the result is 257 bits in width
so, you have to write your code to handle at least that many bits
when dealing with large values in 32bit code, it is best to use whole DWORD's
to accomodate the result with DWORD's, we need at least 9 DWORD's of width (288 bits)
next, to define the large values in the .DATA section, we reverse the order from what may seem natural :P
; p = EABC39EB D635C26C 6977419B 5101A1FE 4B4E6804 344C5DCF E5E6A8C8 778FE3E9
; q = AA469F19 BA22C209 8A130F9B E52C747B A64ECDB1 799E5564 39531014 24126A6F
; p + q = 1 9502D905 90588475 F38A5137 362E1679 F19D35B5 ADEAB334 1F39B8DC 9BA24E58
p_prime dd 778FE3E9h,0E5E6A8C8h,344C5DCFh,4B4E6804h,5101A1FEh,6977419Bh,0D635C26Ch,0EABC39EBh,0
q_prime dd 24126A6Fh,39531014h,799E5564h,0A64ECDB1h,0E52C747Bh,8A130F9Bh,0BA22C209h,0AA469F19h,0
notice that i have added a 0 DWORD at the highorder end of each value, making them 288 bits, each
now, all we need is some code to display and add
give me a few minutes, and i will post a test program.....

here  try this.....

:)
; p + q = 1 9502D905 90588475 F38A5137 362E1679 F19D35B5 ADEAB334 1F39B8DC 9BA24E58
; p + q = 1 9502D905 90588475 F38A5137 362E1679 F19D35B5 799EB334 1F39B8DC 9BA24E58
I see a typo there with edx but nice one, looks good.
Got confused with the byte order.
Thanks for taking time to help me, well appreciated.
We just have multiplication and division/modulus left...
j/k :)

good catch  corrected the typo
division will be fun :biggrin:
how large is the divisor ?

I found a couple of things to work with but it's late where i am probably won't do anything with it until tomorrow.
For multiply, i have
uint64_t multiply(uint64_t a, uint64_t b) {
uint64_t c = 0;
while (a > 0) {
if (a & 1)
c += b;
a >>= 1;
b <<= 1;
}
return c;
}
This isn't too bad, could use rcr/rcl/add/adc there..with big numbers of course.
Not sure about division but I'm guessing something similar with sub/sbb will do it.
Did see this for 128bit numbers
public static void DivMod (Int128 dividend, Int128 divisor, out Int128 quotient, out Int128 remainder)
{
quotient = dividend;
remainder = 0;
for (int i = 0; i < 128; i++)
{
// Left shift Remainder:Quotient by 1
remainder <<= 1;
if (quotient < 0)
remainder._lo = 1;
quotient <<= 1;
if (remainder >= divisor)
{
remainder = divisor;
quotient++;
}
}
}
It doesn't have to be fast at all and in fact im just trying to grasp basics of how it's done at machine code level.
Picking simple algos... so no montgomery multiplication.

well, i will give you a couple hints
if you multiply a 45bit value by a 29bit value, the result may be as large as (45+29) bits wide
for integer division, the modulus width may be as wide as the divisor, no wider
for integer division of values with large divisors, google "long division algorithm"
if the divisor is 32 bits or less, you can use "multipleprecision division"
there are other ways to go  like multipleprecision FPU code, but that's probably not what you're looking for

couple more hints for long division
write the add and subtract routines first
then write shiftleft and shiftright routines
you can make the division routine with a couple loops and calls :P

Sounds like a good plan!
I'll get back tomorrow with my progress :)