The MASM Forum
General => The Campus => Topic started by: gelatine1 on May 01, 2014, 04:36:56 PM

Which complexity does the mul instruction use ? Is it possible to make something faster than the mul instruction (with a faster algorithm)?

Search The Laboratory for imul lea shl. Check also \Masm32\Help\Opcodes.chm for these instructions.

Okay, I checked the mul instruction (I wanted unsigned) in the file But it doesn't say anything exactly if this can be done faster or not ?
And what do they mean with EDX:EAX ?

Hi gelatine1,
Okay, I checked the mul instruction (I wanted unsigned) in the file But it doesn't say anything exactly if this can be done faster or not ?
And what do they mean with EDX:EAX ?
a multiplication of 2 32bit numbers leads to a 64bit result. EAX holds bits 031 and EDX holds bits 3263 of the result. There's nothing magic and the Intel or AMD manuals would have tell you that "secret".
Gunther

Okay, I checked the mul instruction (I wanted unsigned) in the file But it doesn't say anything exactly if this can be done faster or not ?
Hi,
In the general case, mostly no. A MUL instruction will be the
fastest way to multiply. For special cases you can do better
using other logic. For instance, multiplying by a factor of two
might be faster using SHL 1, shift left by one position. jj2007
mentioned the LEA instruction, that can be used for a fast
multiply by 5 in some cases. However you have to be careful
that the argument being multiplied is not too large.
Until you really need to speed some code up, I would suggest
you use MUL or IMUL and just work on getting the right answer.
Unless you just want to play around getting things done in a
different way of course. That can be fun in itself at times.
Cheers,
Steve N.

MUL is fairly fast (8 to 10 cycles on my machine)
it's DIV that is slow :P

Thanks for the replies, I have a few more questions:
@FORTRANS: How exactly would I use lea to multiply by 5 ? I am just wondering because I wouldn't actually now how it would be done.
@dedndave: Do you mean that I could write some code that runs faster than the div instruction ? Or do you mean that a division is slow in general ?

both are true, to some degree
DIV is slow  varies from machine to machine
but, on mine it runs 60 to 80 cycles or so
many guys have figured out ways to multiply by the inverse of a number
this may be used as a fast division in cases where the divisor is a constant
qWord has written a nice tool for that called Magic Number
LEA.....
lea eax,[4*edx+edx] ;EAX = 5*EDX, provided it doesn't overflow 32 bits
works well on most modern processors
although, not wonderful on the pentium 4's  MUL may be as good, IMUL slightly better

http://www.masmforum.com/archive2012/6717_MagicNumber64.zip (http://www.masmforum.com/archive2012/6717_MagicNumber64.zip)

How exactly would I use lea to multiply by 5 ? I am just wondering because I wouldn't actually now how it would be done.
Hi,
From the Assembly Gems site, in their "LEA a power command"
page, I got the following:
lea ebx,[ebx+ebx*4] ; ebx:=ebx*5
Lots of confusing stuff there.
http://www.df.lth.se/~john_e/fr_gems.html (http://www.df.lth.se/~john_e/fr_gems.html)
Cheers,
Steve N.

Thanks a lot for the replies once again, I'm still wondering why exactly this 'lea trick' runs faster. It seems very clever but why would it only be faster for a multiplication with 5 ? Why wouldnt it be faster for multiplication with any number ?
Ill take a look to the magic number later when i have finished my multiplication (I'm writing a 64bit multiplication)
And thank you for the gem page. I only looked for a minute and i read a lot of interesting stuff already.

It can't be faster for any number because it doesn't work with "any" number.
lea ebx,[ebx+ebx*2] ; ebx:=ebx*3
lea ebx,[ebx+ebx*4] ; ebx:=ebx*5
lea ebx,[ebx+ebx*8] ; ebx:=ebx*9
lea ebx,[ebx+ebx*4] ; ebx:=ebx*10
add ebx, ebx
lea edx,[ebx+ebx*2] ; ebx:=ebx*7
lea ebx,[edx+ebx*4]

Thanks a lot for the replies once again, I'm still wondering why exactly this 'lea trick' runs faster. It seems very clever but why would it only be faster for a multiplication with 5 ? Why wouldnt it be faster for multiplication with any number ?
Ill take a look to the magic number later when i have finished my multiplication (I'm writing a 64bit multiplication)
And thank you for the gem page. I only looked for a minute and i read a lot of interesting stuff already.
The reason it runs faster (LEA ebx,[ebx+ebx*4]) is that the computation of x*4 and x*1 is done in the pipeline before it ever gets to the operation (arithmetic) unit, and the result is then just placed into the register by the OU. A multiplication is not actually done, the x is shifted by 2 bits and just added to x. Note LEA means Load Effective Address and the pipeline is used to calculate addresses for memory accesses, but the LEA instruction makes use of this fact and uses the calculated "address" as an operand to be loaded into a register.
Dave.

There are differences in speed, but you'll have to test them for every CPU. Example:
Intel(R) Celeron(R) M CPU 420 @ 1.60GHz (SSE3)
260 cycles for 100 * eax*5, lea
281 cycles for 100 * eax*5, imul
285 cycles for 100 * eax*7, lea
277 cycles for 100 * eax*7, imul
409 cycles for 100 * eax*7, mul
260 cycles for 100 * eax*5, lea
281 cycles for 100 * eax*5, imul
285 cycles for 100 * eax*7, lea
281 cycles for 100 * eax*7, imul
409 cycles for 100 * eax*7, mul
261 cycles for 100 * eax*5, lea
281 cycles for 100 * eax*5, imul
285 cycles for 100 * eax*7, lea
281 cycles for 100 * eax*7, imul
409 cycles for 100 * eax*7, mul
5 bytes for eax*5, lea
5 bytes for eax*5, imul
8 bytes for eax*7, lea
5 bytes for eax*7, imul
9 bytes for eax*7, mul

PMMX W98
preP4
424 cycles for 100 * eax*5, lea
1137 cycles for 100 * eax*5, imul
625 cycles for 100 * eax*7, lea
1136 cycles for 100 * eax*7, imul
1139 cycles for 100 * eax*7, mul
424 cycles for 100 * eax*5, lea
1138 cycles for 100 * eax*5, imul
627 cycles for 100 * eax*7, lea
1138 cycles for 100 * eax*7, imul
1141 cycles for 100 * eax*7, mul
429 cycles for 100 * eax*5, lea
1163 cycles for 100 * eax*5, imul
650 cycles for 100 * eax*7, lea
1168 cycles for 100 * eax*7, imul
1157 cycles for 100 * eax*7, mul
5 bytes for eax*5, lea
5 bytes for eax*5, imul
8 bytes for eax*7, lea
5 bytes for eax*7, imul
9 bytes for eax*7, mul
 ok 
PIII W2k
preP4 (SSE1)
312 cycles for 100 * eax*5, lea
312 cycles for 100 * eax*5, imul
312 cycles for 100 * eax*7, lea
312 cycles for 100 * eax*7, imul
418 cycles for 100 * eax*7, mul
312 cycles for 100 * eax*5, lea
312 cycles for 100 * eax*5, imul
312 cycles for 100 * eax*7, lea
311 cycles for 100 * eax*7, imul
420 cycles for 100 * eax*7, mul
312 cycles for 100 * eax*5, lea
312 cycles for 100 * eax*5, imul
312 cycles for 100 * eax*7, lea
312 cycles for 100 * eax*7, imul
419 cycles for 100 * eax*7, mul
5 bytes for eax*5, lea
5 bytes for eax*5, imul
8 bytes for eax*7, lea
5 bytes for eax*7, imul
9 bytes for eax*7, mul
 ok 

Jochen,
the results:
Intel(R) Core(TM) i73770 CPU @ 3.40GHz (SSE4)
182 cycles for 100 * eax*5, lea
194 cycles for 100 * eax*5, imul
216 cycles for 100 * eax*7, lea
452 cycles for 100 * eax*7, imul
607 cycles for 100 * eax*7, mul
184 cycles for 100 * eax*5, lea
179 cycles for 100 * eax*5, imul
215 cycles for 100 * eax*7, lea
194 cycles for 100 * eax*7, imul
253 cycles for 100 * eax*7, mul
185 cycles for 100 * eax*5, lea
188 cycles for 100 * eax*5, imul
222 cycles for 100 * eax*7, lea
458 cycles for 100 * eax*7, imul
612 cycles for 100 * eax*7, mul
5 bytes for eax*5, lea
5 bytes for eax*5, imul
8 bytes for eax*7, lea
5 bytes for eax*7, imul
9 bytes for eax*7, mul
 ok 
Gunther

Intel(R) Core(TM) i52430M CPU @ 2.40GHz (SSE4)
163 cycles for 100 * eax*5, lea
184 cycles for 100 * eax*5, imul
203 cycles for 100 * eax*7, lea
221 cycles for 100 * eax*7, imul
234 cycles for 100 * eax*7, mul
163 cycles for 100 * eax*5, lea
189 cycles for 100 * eax*5, imul
204 cycles for 100 * eax*7, lea
180 cycles for 100 * eax*7, imul
238 cycles for 100 * eax*7, mul
162 cycles for 100 * eax*5, lea
185 cycles for 100 * eax*5, imul
231 cycles for 100 * eax*7, lea
185 cycles for 100 * eax*7, imul
232 cycles for 100 * eax*7, mul
5 bytes for eax*5, lea
5 bytes for eax*5, imul
8 bytes for eax*7, lea
5 bytes for eax*7, imul
9 bytes for eax*7, mul
Why exactly is mul so much slower as imul ? I mean imul has to do more than mul right ? Imul also has to handle the sign ? or am I wrong ?

those timings are for 100 passes
seeing as an instruction can only consume wholenumber cycles......
you're looking at roughly 2 cycles for everyone  lol
they cache a little differently, so have different timings
btw  your i5 is a screamer
the numbers look different on older CPU's

here is my prescott  one of the pentium 4 variations...
Intel(R) Pentium(R) 4 CPU 3.00GHz (SSE3)
334 cycles for 100 * eax*5, lea
450 cycles for 100 * eax*5, imul
364 cycles for 100 * eax*7, lea
449 cycles for 100 * eax*7, imul
550 cycles for 100 * eax*7, mul
292 cycles for 100 * eax*5, lea
448 cycles for 100 * eax*5, imul
388 cycles for 100 * eax*7, lea
450 cycles for 100 * eax*7, imul
551 cycles for 100 * eax*7, mul
293 cycles for 100 * eax*5, lea
448 cycles for 100 * eax*5, imul
364 cycles for 100 * eax*7, lea
478 cycles for 100 * eax*7, imul
550 cycles for 100 * eax*7, mul

Why exactly is mul so much slower as imul ? I mean imul has to do more than mul right ?
On the contrary, mul has to fill also the edx register.

AMD A83520M APU with Radeon(tm) HD Graphics (SSE3)
498 cycles for 100 * eax*5, lea
498 cycles for 100 * eax*5, imul
538 cycles for 100 * eax*7, lea
498 cycles for 100 * eax*7, imul
660 cycles for 100 * eax*7, mul
452 cycles for 100 * eax*5, lea
453 cycles for 100 * eax*5, imul
489 cycles for 100 * eax*7, lea
453 cycles for 100 * eax*7, imul
598 cycles for 100 * eax*7, mul
453 cycles for 100 * eax*5, lea
453 cycles for 100 * eax*5, imul
489 cycles for 100 * eax*7, lea
453 cycles for 100 * eax*7, imul
599 cycles for 100 * eax*7, mul
5 bytes for eax*5, lea
5 bytes for eax*5, imul
8 bytes for eax*7, lea
5 bytes for eax*7, imul
9 bytes for eax*7, mul
 ok 
Dave.

Why exactly is mul so much slower as imul ? I mean imul has to do more than mul right ?
On the contrary, mul has to fill also the edx register.
Hi,
Both MUL and IMUL fill the EDX register when multiplying EAX.
And they should take about the same time to execute. Jochen
is comparing somewhat different code to try to approximate
real code.
.Repeat
mov eax, 1000
imul eax, eax, 7
dec ebx
.Until Sign?
.Repeat
mov eax, 1000
mov edx, 7
mul edx
dec ebx
.Until Sign?
If the "mov edx, 7" was moved outside the loop the timings
should be closer. But rather less probable in real code.
Another possibility would be defining a variable in memory to
hold a 7, and use that "MUL [Seven]". Still another memory
access, but possibly a bit closer to his IMUL example. One chooses
an example and tests that one.
Regards,
Steve N.

Both MUL and IMUL fill the EDX register when multiplying EAX.
really ? :redface:

If the "mov edx, 7" was moved outside the loop the timings
should be closer.
With a mov ecx, 7 before the loop, timings drop from 411 to 357, but it's still by far the slowest way to multiply.
@dave: not really ;)

Hi,
Oops. It depends on which version of IMUL is used. The
one argument version does fill it. The two and three argument
versions do not. And of course I don't use IMUL often, and
when I did, I used the one argument version. Habit, I guess.
Regards,
Steve N.

It depends on which version of IMUL is used. The one argument version does fill it. The two and three argument versions do not.
Little demo:
include \masm32\MasmBasic\MasmBasic.inc ; download (http://masm32.com/board/index.php?topic=94.0)
Init
mov edx, 87654321
mov ecx, 200
imul eax, ecx, 1000
imul ecx, 50
deb 4, "three+two args, edx unchanged", eax, ecx, edx
mov eax, 123456789
imul ecx
push edx ; we'll display the edx:eax
push eax ; pair as a qword via xmm0
movlps xmm0, QWORD PTR [esp]
deb 4, "one arg, 123456789*10000", edx, eax, ecx, xmm0
add esp, 2*DWORD
Exit
end start
Output:
three+two args, edx unchanged
eax 200000
ecx 10000
edx 87654321
one arg, 123456789*10000
edx 287
eax 1912276048
ecx 10000
xmm0 1234567890000

Hi Jochen,
Nice demonstration, I may have to try out MasmBasic on an XP
computer here. Amusing that all the numbers were positive.
Looking at IMUL, I wonder if long sequences of multiplies would
be sped up any by using a series of different target registers?
Cheers,
Steve

Hi Jochen,
Nice demonstration, I may have to try out MasmBasic on an XP
computer here. Amusing that all the numbers were positive.
Thanks ;)
With mov ecx, 200:
three+two args, edx unchanged
eax 200000
ecx 10000
edx 87654321
one arg, 123456789*10000
edx 288
eax 1912276048
ecx 10000
xmm0 1234567890000
Looking at IMUL, I wonder if long sequences of multiplies would
be sped up any by using a series of different target registers?
You mean alternating regs? Can you give an example?

Hi,
Would the first block of code be "noticeably" faster than the
second?
MOV EAX,[Mem1]
MOV EBX,[Mem2]
MOV ECX,[Mem3]
MOV EDX,[Mem4]
MOV EDI,[Mem5]
MOV ESI,[Mem6]
IMUL EAX,[Arg1]
IMUL EBX,[Arg2]
IMUL ECX,[Arg3]
IMUL EDX,[Arg4]
IMUL EDI,[Arg5]
IMUL ESI,[Arg6]
MOV [Res1],EAX
MOV [Res2],EBX
MOV [Res3],ECX
MOV [Res4],EDX
MOV [Res5],EDI
MOV [Res6],ESI
MOV EAX,[Mem1]
IMUL EAX,[Arg1]
MOV [Res1],EAX
MOV EAX,[Mem2]
IMUL EAX,[Arg2]
MOV [Res2],EAX
MOV EAX,[Mem3]
IMUL EAX,[Arg3]
MOV [Res3],EAX
MOV EAX,[Mem4]
IMUL EAX,[Arg4]
MOV [Res4],EAX
MOV EAX,[Mem5]
IMUL EAX,[Arg5]
MOV [Res5],EAX
MOV EAX,[Mem6]
IMUL EAX,[Arg6]
MOV [Res6],EAX
Regards,
Steve N.

Would the first block of code be "noticeably" faster than the
second?
"Noticeably" indeed, Steve: one cycle less. It's ten bytes longer, though, so occasionally the cache might make block B the better solution.
AMD Athlon(tm) Dual Core Processor 4450B (SSE3)
1520 cycles for 100 * block A
1610 cycles for 100 * block B
1513 cycles for 100 * block A
1611 cycles for 100 * block B
1513 cycles for 100 * block A
1614 cycles for 100 * block B
111 bytes for block A
101 bytes for block B

Hi,
Using the median of the timing results (rounded); AMD Athlon 6.5%,
PIII 1.8%, PMMX 7.8%, and Pentium M 1.7%. Cute results, thank
you for the benchmark. Doesn't make the 15% rule in any event.
preP4 (SSE1)
1698 cycles for 100 * block A
1728 cycles for 100 * block B
1711 cycles for 100 * block A
1732 cycles for 100 * block B
1701 cycles for 100 * block A
1733 cycles for 100 * block B
111 bytes for block A
101 bytes for block B
 ok  preP4
6471 cycles for 100 * block A
6974 cycles for 100 * block B
6481 cycles for 100 * block A
6978 cycles for 100 * block B
6472 cycles for 100 * block A
6979 cycles for 100 * block B
111 bytes for block A
101 bytes for block B
 ok  Intel(R) Pentium(R) M processor 1.70GHz (SSE2)
1360 cycles for 100 * block A
1335 cycles for 100 * block B
1358 cycles for 100 * block A
1331 cycles for 100 * block B
1356 cycles for 100 * block A
1337 cycles for 100 * block B
111 bytes for block A
101 bytes for block B
 ok 
Regards,
Steve N.

Jochen,
Intel(R) Core(TM) i73770 CPU @ 3.40GHz (SSE4)
896 cycles for 100 * block A
896 cycles for 100 * block B
895 cycles for 100 * block A
895 cycles for 100 * block B
895 cycles for 100 * block A
894 cycles for 100 * block B
111 bytes for block A
101 bytes for block B
 ok 
Gunther

Thanks  and here my old machine:
Intel(R) Celeron(R) M CPU 420 @ 1.60GHz (SSE3)
1349 cycles for 100 * block A
1329 cycles for 100 * block B
1349 cycles for 100 * block A
1328 cycles for 100 * block B
1352 cycles for 100 * block A
1331 cycles for 100 * block B
111 bytes for block A
101 bytes for block B

Why would someone use the Imul that only stores into eax ? it only allows to multiply two numbers a and b where log(a)+log(b) is smaller than 32 whereas the mul instruction increases that bound to 64 ?

it handles signed integers
it allows immediate operands
it allows multiplication of registers other than EAX (for MUL, one operand must always be in EAX)
it's probably a little faster, depending on the CPU
IMUL is often handy for calculating addresses that LEA won't handle (array index, etc)
you use the one that's most appropriate for what you want to do