News:

Masm32 SDK description, downloads and other helpful links
Message to All Guests
NB: Posting URL's See here: Posted URL Change

Main Menu

Large 64-bit MUL

Started by dedndave, April 22, 2013, 11:14:29 AM

Previous topic - Next topic

dedndave

my brain isn't working as well as it used to - lol
i am wondering if some of you guys might help   :t

i am having difficulty managing the registers for this routine
;############################################################################################

LrgMul64 PROC USES EBX ESI EDI lpuLrgInt:LPVOID,nDwords:DWORD,uMultLo:DWORD,uMultHi:DWORD

;Call With:
;lpuLrgInt       = pointer to large unsigned integer
;                  the buffer must be at least 2 dwords longer than the input value
;                  the bytes at the end of the integer must be 0 to form a complete dword
;nDwords         = number of dwords in integer pointed to by lpuLrgInt
;uMultHi:uMultLo = 64-bit multiplier

;Returns:
;integer pointed to by lpuLrgInt is multiplied by 64-bit value of uMultHi:uMultLo

        xor     ecx,ecx
        mov     esi,lpuLrgInt
        xor     ebx,ebx
        xor     edi,edi

LgMul0: mov     eax,[esi]
        push    eax
        mul dword ptr uMultLo
        add     ecx,eax
        adc     ebx,edx
        pop     eax
        adc     edi,0
        mov     [esi],ecx
        mul dword ptr uMultHi
        add     esi,4
        add     ebx,eax
        mov     ecx,0
        adc     edi,edx
        adc     ecx,0
        dec dword ptr nDwords
        xchg    ebx,edi
        xchg    ecx,edi
        jnz     LgMul0

        mov     [esi],ecx
        mov     [esi+4],ebx
        ret

LrgMul64 ENDP

;############################################################################################


EDIT: corrected code for the working routine
the routine does not set any return values, as shown
it could be modified to return the length of the resulting integer

dedndave

i think i got it   :P
i need to do some testing

Ficko

Hi Dale,

I am using this proc for my stuff may give you some hint.


HIGHDWORD equ 4

i64mul proc multiplicand:SQWORD,multiplier:SQWORD
mov edx, multiplicand + HIGHDWORD
mov ecx, multiplier + HIGHDWORD
or edx, ecx ; One operand >= 2^32?
mov edx, multiplier
mov eax, multiplicand
jnz @F ; Yes, need two multiplies.
mul edx ; multiplicand_lo * multiplier_lo
ret 10h ; Done, return to caller.

@@: imul edx, multiplicand + HIGHDWORD                  ; p3_lo = multiplicand_hi * multiplier_lo
imul ecx, eax ; p2_lo = multiplier_hi * multiplicand_lo
add ecx, edx ; p2_lo + p3_lo
mul dword ptr multiplier+ HIGHDWORD                     ; p1 = multiplicand_lo * multiplier_lo
add edx, ecx ; p1 + p2_lo + p3_lo = result in EDX:EAX
i64mul endp

TouEnMasm


Seems that intel had made it for you here (vc++ express):
C:\Program Files\Microsoft Visual Studio 10.0\VC\crt\src\intel\llmul.asm
Fa is a musical note to play with CL

dedndave

thanks guys
64 x 64 isn't too hard
i am doing 64 x 32N   :P
it would be easy with 64-bit registers - lol

FORTRANS

Hi Dave,

   If I can get N x N byte and word multiplies to work, I am sure you
can get N x N double multiplies to work.  I may have to revisit mine
to find out why I have a really odd routine in there to make it work.
It has to be a blindness on my part as no one else needs it.

Best of luck,

Steve N.

dedndave

i am close   :lol:

        xor     ecx,ecx
        mov     esi,lpuLrgInt
        xor     ebx,ebx
        xor     edi,edi

;EDX:EAX = working registers
;ESI = source pointer
;ECX = carry 0
;EBX = carry 1
;EDI = carry 2

loop00: mov     eax,[esi]
        push    eax

        mul dword ptr uMultLo

        add     ecx,eax
        adc     ebx,edx
        pop     eax
        adc     edi,0
        mov     [esi],ecx

        mul dword ptr uMultHi

        add     ebx,eax
        mov     ecx,0
        adc     edi,edx
        adc     ecx,0

        xchg    ebx,edi
        xchg    ecx,edi

        add     esi,4
        dec dword ptr nDwords
        jnz     loop00

        mov     [esi],ecx
        mov     [esi+4],ebx
        ret


but no cigar   :(

dedndave

it works !!!!!   :eusa_dance:

i was passing the wrong register on the call - lol

dedndave

i updated the code in the original post   :t

Jibz

Quote from: ToutEnMasm on April 22, 2013, 06:11:02 PM

Seems that intel had made it for you here (vc++ express):
C:\Program Files\Microsoft Visual Studio 10.0\VC\crt\src\intel\llmul.asm

As an interesting aside, the implementation of the 64-bit arithmetic functions for 32-bit code in Visual C++ are not optimal (at least on todays hardware). I wrote a post about it a couple of years ago if anyone is interested:

http://www.hardtoc.com/archives/154
"A problem, properly stated, is a problem on it's way to being solved" -Buckminster Fuller
"Multithreading is just one damn thing after, before, or simultaneous with another" -Andrei Alexandrescu

dedndave

no matter - i looked at the code for that function
it appears to be for 64x64 mul

i didn't bother timing my routine
it could be done faster with SSE code, i am sure
but, one of my goals in this case is not to use MMX/FPU/XMM registers

jj2007

Quote from: dedndave on April 23, 2013, 08:43:47 AM
but, one of my goals in this case is not to use MMX/FPU/XMM registers

What's wrong with the FPU? Just curious ;-)

AMD Athlon(tm) Dual Core Processor 4450B (SSE3)
loop overhead is approx. 238/100 cycles

3994    cycles for 100 * div
1971    cycles for 100 * fdiv

3984    cycles for 100 * div
1902    cycles for 100 * fdiv

3981    cycles for 100 * div
1912    cycles for 100 * fdiv

16      bytes for div
16      bytes for fdiv

TestA proc
  mov ebx, AlgoLoops-1   ; loop e.g. 100x
  mov ecx, 123
  push -1
  push -123
  align 4
  .Repeat
   mov eax, dword ptr [esp]
   mov edx, dword ptr [esp+4]
   idiv ecx
   dec ebx
  .Until Sign?
  pop eax
  pop edx
  ret
TestA endp

align 16
TestB proc
  mov ebx, AlgoLoops-1   ; loop e.g. 100x
  mov ecx, 123
  push -1
  push -123
  align 4
  .Repeat
   fild qword ptr [esp]
   fdiv FP8(123.0)
   dec ebx
  .Until Sign?
  pop eax
  pop edx
  ret
TestB endp

dedndave

nothing wrong with using FPU or SSE, at all
i am doing a float-to-ascii type routine
my "theory" is, if conversion routines don't use those registers,
you can convert and display values in the middle of your FPU/SSE code, without destroying the contents or state

i don't understand why you are timing the DIV operation
my desired routine performs a MUL operation

if you want to time my function.....
the longest integer in my application is 1194 dwords long
so, the worst case would be that many dwords of FFFFFFFFh, multiplied by FFFFFFFF_FFFFFFFFh
be sure to leave extra room in the buffer (at least 2 more dwords)
see the code in the first post
i am guessing something like 35,000 clock cycles   :P

EDIT: i was close   :P
i get about 41,000 cycles on my old P4

dedndave

oh, and that's only part of the function - lol

to generate the exponential probably takes 10,000 cycles
then, multiply it - 41,000 cycles
then, convert to ASCII (11,514 digits) - probably another 40,000 cycles

so, to evaluate the REAL10 value 0001_FFFFFFFF_FFFFFFFF to full precision, about 100,000 cycles
i am sure that's faster than my old version
it shifted the ASCII string right to divide by 2^N
evaluating the same real with that method took about 1 to 2 seconds   :biggrin:

Gunther

Hi Jibz,

Quote from: Jibz on April 23, 2013, 08:40:09 AM
As an interesting aside, the implementation of the 64-bit arithmetic functions for 32-bit code in Visual C++ are not optimal (at least on todays hardware). I wrote a post about it a couple of years ago if anyone is interested:

http://www.hardtoc.com/archives/154

very interesting post. I'll check out next weekend what the gcc is "saying".

Gunther
You have to know the facts before you can distort them.