my brain isn't working as well as it used to - lol
i am wondering if some of you guys might help :t
i am having difficulty managing the registers for this routine
;############################################################################################
LrgMul64 PROC USES EBX ESI EDI lpuLrgInt:LPVOID,nDwords:DWORD,uMultLo:DWORD,uMultHi:DWORD
;Call With:
;lpuLrgInt = pointer to large unsigned integer
; the buffer must be at least 2 dwords longer than the input value
; the bytes at the end of the integer must be 0 to form a complete dword
;nDwords = number of dwords in integer pointed to by lpuLrgInt
;uMultHi:uMultLo = 64-bit multiplier
;Returns:
;integer pointed to by lpuLrgInt is multiplied by 64-bit value of uMultHi:uMultLo
xor ecx,ecx
mov esi,lpuLrgInt
xor ebx,ebx
xor edi,edi
LgMul0: mov eax,[esi]
push eax
mul dword ptr uMultLo
add ecx,eax
adc ebx,edx
pop eax
adc edi,0
mov [esi],ecx
mul dword ptr uMultHi
add esi,4
add ebx,eax
mov ecx,0
adc edi,edx
adc ecx,0
dec dword ptr nDwords
xchg ebx,edi
xchg ecx,edi
jnz LgMul0
mov [esi],ecx
mov [esi+4],ebx
ret
LrgMul64 ENDP
;############################################################################################
EDIT: corrected code for the working routine
the routine does not set any return values, as shown
it could be modified to return the length of the resulting integer
i think i got it :P
i need to do some testing
Hi Dale,
I am using this proc for my stuff may give you some hint.
HIGHDWORD equ 4
i64mul proc multiplicand:SQWORD,multiplier:SQWORD
mov edx, multiplicand + HIGHDWORD
mov ecx, multiplier + HIGHDWORD
or edx, ecx ; One operand >= 2^32?
mov edx, multiplier
mov eax, multiplicand
jnz @F ; Yes, need two multiplies.
mul edx ; multiplicand_lo * multiplier_lo
ret 10h ; Done, return to caller.
@@: imul edx, multiplicand + HIGHDWORD ; p3_lo = multiplicand_hi * multiplier_lo
imul ecx, eax ; p2_lo = multiplier_hi * multiplicand_lo
add ecx, edx ; p2_lo + p3_lo
mul dword ptr multiplier+ HIGHDWORD ; p1 = multiplicand_lo * multiplier_lo
add edx, ecx ; p1 + p2_lo + p3_lo = result in EDX:EAX
i64mul endp
Seems that intel had made it for you here (vc++ express):
C:\Program Files\Microsoft Visual Studio 10.0\VC\crt\src\intel\llmul.asm
thanks guys
64 x 64 isn't too hard
i am doing 64 x 32N :P
it would be easy with 64-bit registers - lol
Hi Dave,
If I can get N x N byte and word multiplies to work, I am sure you
can get N x N double multiplies to work. I may have to revisit mine
to find out why I have a really odd routine in there to make it work.
It has to be a blindness on my part as no one else needs it.
Best of luck,
Steve N.
i am close :lol:
xor ecx,ecx
mov esi,lpuLrgInt
xor ebx,ebx
xor edi,edi
;EDX:EAX = working registers
;ESI = source pointer
;ECX = carry 0
;EBX = carry 1
;EDI = carry 2
loop00: mov eax,[esi]
push eax
mul dword ptr uMultLo
add ecx,eax
adc ebx,edx
pop eax
adc edi,0
mov [esi],ecx
mul dword ptr uMultHi
add ebx,eax
mov ecx,0
adc edi,edx
adc ecx,0
xchg ebx,edi
xchg ecx,edi
add esi,4
dec dword ptr nDwords
jnz loop00
mov [esi],ecx
mov [esi+4],ebx
ret
but no cigar :(
it works !!!!! :eusa_dance:
i was passing the wrong register on the call - lol
i updated the code in the original post :t
Quote from: ToutEnMasm on April 22, 2013, 06:11:02 PM
Seems that intel had made it for you here (vc++ express):
C:\Program Files\Microsoft Visual Studio 10.0\VC\crt\src\intel\llmul.asm
As an interesting aside, the implementation of the 64-bit arithmetic functions for 32-bit code in Visual C++ are not optimal (at least on todays hardware). I wrote a post about it a couple of years ago if anyone is interested:
http://www.hardtoc.com/archives/154
no matter - i looked at the code for that function
it appears to be for 64x64 mul
i didn't bother timing my routine
it could be done faster with SSE code, i am sure
but, one of my goals in this case is not to use MMX/FPU/XMM registers
Quote from: dedndave on April 23, 2013, 08:43:47 AM
but, one of my goals in this case is not to use MMX/FPU/XMM registers
What's wrong with the FPU? Just curious ;-)
AMD Athlon(tm) Dual Core Processor 4450B (SSE3)
loop overhead is approx. 238/100 cycles
3994 cycles for 100 * div
1971 cycles for 100 * fdiv
3984 cycles for 100 * div
1902 cycles for 100 * fdiv
3981 cycles for 100 * div
1912 cycles for 100 * fdiv
16 bytes for div
16 bytes for fdiv
TestA proc
mov ebx, AlgoLoops-1 ; loop e.g. 100x
mov ecx, 123
push -1
push -123
align 4
.Repeat
mov eax, dword ptr [esp]
mov edx, dword ptr [esp+4]
idiv ecx
dec ebx
.Until Sign?
pop eax
pop edx
ret
TestA endp
align 16
TestB proc
mov ebx, AlgoLoops-1 ; loop e.g. 100x
mov ecx, 123
push -1
push -123
align 4
.Repeat
fild qword ptr [esp]
fdiv FP8(123.0)
dec ebx
.Until Sign?
pop eax
pop edx
ret
TestB endp
nothing wrong with using FPU or SSE, at all
i am doing a float-to-ascii type routine
my "theory" is, if conversion routines don't use those registers,
you can convert and display values in the middle of your FPU/SSE code, without destroying the contents or state
i don't understand why you are timing the DIV operation
my desired routine performs a MUL operation
if you want to time my function.....
the longest integer in my application is 1194 dwords long
so, the worst case would be that many dwords of FFFFFFFFh, multiplied by FFFFFFFF_FFFFFFFFh
be sure to leave extra room in the buffer (at least 2 more dwords)
see the code in the first post
i am guessing something like 35,000 clock cycles :P
EDIT: i was close :P
i get about 41,000 cycles on my old P4
oh, and that's only part of the function - lol
to generate the exponential probably takes 10,000 cycles
then, multiply it - 41,000 cycles
then, convert to ASCII (11,514 digits) - probably another 40,000 cycles
so, to evaluate the REAL10 value 0001_FFFFFFFF_FFFFFFFF to full precision, about 100,000 cycles
i am sure that's faster than my old version
it shifted the ASCII string right to divide by 2^N
evaluating the same real with that method took about 1 to 2 seconds :biggrin:
Hi Jibz,
Quote from: Jibz on April 23, 2013, 08:40:09 AM
As an interesting aside, the implementation of the 64-bit arithmetic functions for 32-bit code in Visual C++ are not optimal (at least on todays hardware). I wrote a post about it a couple of years ago if anyone is interested:
http://www.hardtoc.com/archives/154
very interesting post. I'll check out next weekend what the gcc is "saying".
Gunther
Quote from: dedndave on April 23, 2013, 07:05:18 PM
i don't understand why you are timing the DIV operation
my desired routine performs a MUL operation
Sorry, I should read your posts more carefully :redface:
no biggy, Jochen
interesting to see the FDIV stuff :P