The MASM Forum

General => The Campus => Topic started by: gelatine1 on May 01, 2014, 04:36:56 PM

Title: Mul instruction
Post by: gelatine1 on May 01, 2014, 04:36:56 PM
Which complexity does the mul instruction use ?  Is it possible to make something faster than the mul instruction (with a faster algorithm)?
Title: Re: Mul instruction
Post by: jj2007 on May 01, 2014, 05:12:16 PM
Search The Laboratory for imul lea shl. Check also \Masm32\Help\Opcodes.chm for these instructions.
Title: Re: Mul instruction
Post by: gelatine1 on May 01, 2014, 07:29:51 PM
Okay, I checked the mul instruction (I wanted unsigned) in the file But it doesn't say anything exactly if this can be done faster or not ?
And what do they mean with EDX:EAX ?
Title: Re: Mul instruction
Post by: Gunther on May 01, 2014, 07:36:45 PM
Hi gelatine1,

Okay, I checked the mul instruction (I wanted unsigned) in the file But it doesn't say anything exactly if this can be done faster or not ?
And what do they mean with EDX:EAX ?

a multiplication of 2 32-bit numbers leads to a 64-bit result. EAX holds bits 0-31 and EDX holds bits 32-63 of the result. There's nothing magic and the Intel or AMD manuals would have tell you that "secret".

Gunther
Title: Re: Mul instruction
Post by: FORTRANS on May 01, 2014, 10:47:20 PM
Okay, I checked the mul instruction (I wanted unsigned) in the file But it doesn't say anything exactly if this can be done faster or not ?

Hi,

   In the general case, mostly no.  A MUL instruction will be the
fastest way to multiply.  For special cases you can do better
using other logic.  For instance, multiplying by a factor of two
might be faster using SHL 1, shift left by one position.  jj2007
mentioned the LEA instruction, that can be used for a fast
multiply by 5 in some cases.  However you have to be careful
that the argument being multiplied is not too large.

   Until you really need to speed some code up, I would suggest
you use MUL or IMUL and just work on getting the right answer.
Unless you just want to play around getting things done in a
different way of course.  That can be fun in itself at times.

Cheers,

Steve N.
Title: Re: Mul instruction
Post by: dedndave on May 02, 2014, 12:29:42 AM
MUL is fairly fast (8 to 10 cycles on my machine)
it's DIV that is slow   :P
Title: Re: Mul instruction
Post by: gelatine1 on May 02, 2014, 01:02:04 AM
Thanks for the replies, I have a few more questions:
@FORTRANS: How exactly would I use lea to multiply by 5 ? I am just wondering because I wouldn't actually now how it would be done.
@dedndave: Do you mean that I could write some code that runs faster than the div instruction ? Or do you mean that a division is slow in general ?
Title: Re: Mul instruction
Post by: dedndave on May 02, 2014, 01:14:50 AM
both are true, to some degree
DIV is slow - varies from machine to machine
but, on mine it runs 60 to 80 cycles or so

many guys have figured out ways to multiply by the inverse of a number
this may be used as a fast division in cases where the divisor is a constant
qWord has written a nice tool for that called Magic Number

LEA.....
Code: [Select]
    lea     eax,[4*edx+edx]    ;EAX = 5*EDX, provided it doesn't overflow 32 bits
works well on most modern processors
although, not wonderful on the pentium 4's - MUL may be as good, IMUL slightly better
Title: Re: Mul instruction
Post by: dedndave on May 02, 2014, 01:17:58 AM
http://www.masmforum.com/archive2012/6717_MagicNumber64.zip (http://www.masmforum.com/archive2012/6717_MagicNumber64.zip)
Title: Re: Mul instruction
Post by: FORTRANS on May 02, 2014, 01:24:02 AM
How exactly would I use lea to multiply by 5 ? I am just wondering because I wouldn't actually now how it would be done.

Hi,

   From the Assembly Gems site, in their "LEA a power command"
page, I got the following:

Code: [Select]
        lea     ebx,[ebx+ebx*4] ; ebx:=ebx*5
   Lots of confusing stuff there.

http://www.df.lth.se/~john_e/fr_gems.html (http://www.df.lth.se/~john_e/fr_gems.html)

Cheers,

Steve N.
Title: Re: Mul instruction
Post by: gelatine1 on May 02, 2014, 01:29:48 AM
Thanks a lot for the replies once again, I'm still wondering why exactly this 'lea trick' runs faster. It seems very clever but why would it only be faster for a multiplication with 5 ? Why wouldnt it be faster for multiplication with any number ?
Ill take a look to the magic number later when i have finished my multiplication (I'm writing a 64bit multiplication)
And thank you for the gem page. I only looked for a minute and i read a lot of interesting stuff already.
Title: Re: Mul instruction
Post by: jj2007 on May 02, 2014, 01:31:52 AM
It can't be faster for any number because it doesn't work with "any" number.

lea ebx,[ebx+ebx*2] ; ebx:=ebx*3

lea ebx,[ebx+ebx*4] ; ebx:=ebx*5

lea ebx,[ebx+ebx*8] ; ebx:=ebx*9

lea ebx,[ebx+ebx*4] ; ebx:=ebx*10
add ebx, ebx

lea edx,[ebx+ebx*2] ; ebx:=ebx*7
lea ebx,[edx+ebx*4]
Title: Re: Mul instruction
Post by: KeepingRealBusy on May 02, 2014, 01:44:52 AM
Thanks a lot for the replies once again, I'm still wondering why exactly this 'lea trick' runs faster. It seems very clever but why would it only be faster for a multiplication with 5 ? Why wouldnt it be faster for multiplication with any number ?
Ill take a look to the magic number later when i have finished my multiplication (I'm writing a 64bit multiplication)
And thank you for the gem page. I only looked for a minute and i read a lot of interesting stuff already.

The reason it runs faster (LEA ebx,[ebx+ebx*4]) is that the computation of x*4 and x*1 is done in the pipeline before it ever gets to the operation (arithmetic) unit, and the result is then just placed into the register by the OU. A multiplication is not actually done, the x is shifted by 2 bits and just added to x. Note LEA means Load Effective Address and the pipeline is used to calculate addresses for memory accesses, but the LEA instruction makes use of this fact and uses the calculated "address" as an operand to be loaded into a register.

Dave.
Title: Re: Mul instruction
Post by: jj2007 on May 02, 2014, 02:04:03 AM
There are differences in speed, but you'll have to test them for every CPU. Example:

Intel(R) Celeron(R) M CPU        420  @ 1.60GHz (SSE3)
260     cycles for 100 * eax*5, lea
281     cycles for 100 * eax*5, imul
285     cycles for 100 * eax*7, lea
277     cycles for 100 * eax*7, imul
409     cycles for 100 * eax*7, mul

260     cycles for 100 * eax*5, lea
281     cycles for 100 * eax*5, imul
285     cycles for 100 * eax*7, lea
281     cycles for 100 * eax*7, imul
409     cycles for 100 * eax*7, mul

261     cycles for 100 * eax*5, lea
281     cycles for 100 * eax*5, imul
285     cycles for 100 * eax*7, lea
281     cycles for 100 * eax*7, imul
409     cycles for 100 * eax*7, mul

5       bytes for eax*5, lea
5       bytes for eax*5, imul
8       bytes for eax*7, lea
5       bytes for eax*7, imul
9       bytes for eax*7, mul
Title: Re: Mul instruction
Post by: FORTRANS on May 02, 2014, 03:46:40 AM
P-MMX W98

pre-P4
424   cycles for 100 * eax*5, lea
1137   cycles for 100 * eax*5, imul
625   cycles for 100 * eax*7, lea
1136   cycles for 100 * eax*7, imul
1139   cycles for 100 * eax*7, mul

424   cycles for 100 * eax*5, lea
1138   cycles for 100 * eax*5, imul
627   cycles for 100 * eax*7, lea
1138   cycles for 100 * eax*7, imul
1141   cycles for 100 * eax*7, mul

429   cycles for 100 * eax*5, lea
1163   cycles for 100 * eax*5, imul
650   cycles for 100 * eax*7, lea
1168   cycles for 100 * eax*7, imul
1157   cycles for 100 * eax*7, mul

5   bytes for eax*5, lea
5   bytes for eax*5, imul
8   bytes for eax*7, lea
5   bytes for eax*7, imul
9   bytes for eax*7, mul


--- ok ---

P-III W2k

pre-P4 (SSE1)

312   cycles for 100 * eax*5, lea
312   cycles for 100 * eax*5, imul
312   cycles for 100 * eax*7, lea
312   cycles for 100 * eax*7, imul
418   cycles for 100 * eax*7, mul

312   cycles for 100 * eax*5, lea
312   cycles for 100 * eax*5, imul
312   cycles for 100 * eax*7, lea
311   cycles for 100 * eax*7, imul
420   cycles for 100 * eax*7, mul

312   cycles for 100 * eax*5, lea
312   cycles for 100 * eax*5, imul
312   cycles for 100 * eax*7, lea
312   cycles for 100 * eax*7, imul
419   cycles for 100 * eax*7, mul

5   bytes for eax*5, lea
5   bytes for eax*5, imul
8   bytes for eax*7, lea
5   bytes for eax*7, imul
9   bytes for eax*7, mul


--- ok ---
Title: Re: Mul instruction
Post by: Gunther on May 02, 2014, 03:47:33 AM
Jochen,

the results:
Code: [Select]
Intel(R) Core(TM) i7-3770 CPU @ 3.40GHz (SSE4)

182     cycles for 100 * eax*5, lea
194     cycles for 100 * eax*5, imul
216     cycles for 100 * eax*7, lea
452     cycles for 100 * eax*7, imul
607     cycles for 100 * eax*7, mul

184     cycles for 100 * eax*5, lea
179     cycles for 100 * eax*5, imul
215     cycles for 100 * eax*7, lea
194     cycles for 100 * eax*7, imul
253     cycles for 100 * eax*7, mul

185     cycles for 100 * eax*5, lea
188     cycles for 100 * eax*5, imul
222     cycles for 100 * eax*7, lea
458     cycles for 100 * eax*7, imul
612     cycles for 100 * eax*7, mul

5       bytes for eax*5, lea
5       bytes for eax*5, imul
8       bytes for eax*7, lea
5       bytes for eax*7, imul
9       bytes for eax*7, mul

--- ok ---

Gunther
Title: Re: Mul instruction
Post by: gelatine1 on May 02, 2014, 04:13:11 AM
Code: [Select]
Intel(R) Core(TM) i5-2430M CPU @ 2.40GHz (SSE4)

163     cycles for 100 * eax*5, lea
184     cycles for 100 * eax*5, imul
203     cycles for 100 * eax*7, lea
221     cycles for 100 * eax*7, imul
234     cycles for 100 * eax*7, mul

163     cycles for 100 * eax*5, lea
189     cycles for 100 * eax*5, imul
204     cycles for 100 * eax*7, lea
180     cycles for 100 * eax*7, imul
238     cycles for 100 * eax*7, mul

162     cycles for 100 * eax*5, lea
185     cycles for 100 * eax*5, imul
231     cycles for 100 * eax*7, lea
185     cycles for 100 * eax*7, imul
232     cycles for 100 * eax*7, mul

5       bytes for eax*5, lea
5       bytes for eax*5, imul
8       bytes for eax*7, lea
5       bytes for eax*7, imul
9       bytes for eax*7, mul

Why exactly is mul so much slower as imul ? I mean imul has to do more than mul right ? Imul also has to handle the sign ? or am I wrong ?
Title: Re: Mul instruction
Post by: dedndave on May 02, 2014, 04:22:12 AM
those timings are for 100 passes
seeing as an instruction can only consume whole-number cycles......

you're looking at roughly 2 cycles for everyone - lol
they cache a little differently, so have different timings

btw - your i5 is a screamer
the numbers look different on older CPU's
Title: Re: Mul instruction
Post by: dedndave on May 02, 2014, 04:26:32 AM
here is my prescott - one of the pentium 4 variations...

Code: [Select]
Intel(R) Pentium(R) 4 CPU 3.00GHz (SSE3)

334     cycles for 100 * eax*5, lea
450     cycles for 100 * eax*5, imul
364     cycles for 100 * eax*7, lea
449     cycles for 100 * eax*7, imul
550     cycles for 100 * eax*7, mul

292     cycles for 100 * eax*5, lea
448     cycles for 100 * eax*5, imul
388     cycles for 100 * eax*7, lea
450     cycles for 100 * eax*7, imul
551     cycles for 100 * eax*7, mul

293     cycles for 100 * eax*5, lea
448     cycles for 100 * eax*5, imul
364     cycles for 100 * eax*7, lea
478     cycles for 100 * eax*7, imul
550     cycles for 100 * eax*7, mul
Title: Re: Mul instruction
Post by: jj2007 on May 02, 2014, 04:32:54 AM
Why exactly is mul so much slower as imul ? I mean imul has to do more than mul right ?

On the contrary, mul has to fill also the edx register.
Title: Re: Mul instruction
Post by: KeepingRealBusy on May 02, 2014, 04:42:54 AM
AMD A8-3520M APU with Radeon(tm) HD Graphics (SSE3)

498     cycles for 100 * eax*5, lea
498     cycles for 100 * eax*5, imul
538     cycles for 100 * eax*7, lea
498     cycles for 100 * eax*7, imul
660     cycles for 100 * eax*7, mul

452     cycles for 100 * eax*5, lea
453     cycles for 100 * eax*5, imul
489     cycles for 100 * eax*7, lea
453     cycles for 100 * eax*7, imul
598     cycles for 100 * eax*7, mul

453     cycles for 100 * eax*5, lea
453     cycles for 100 * eax*5, imul
489     cycles for 100 * eax*7, lea
453     cycles for 100 * eax*7, imul
599     cycles for 100 * eax*7, mul

5       bytes for eax*5, lea
5       bytes for eax*5, imul
8       bytes for eax*7, lea
5       bytes for eax*7, imul
9       bytes for eax*7, mul


--- ok ---

Dave.
Title: Re: Mul instruction
Post by: FORTRANS on May 03, 2014, 12:51:55 AM
Why exactly is mul so much slower as imul ? I mean imul has to do more than mul right ?

On the contrary, mul has to fill also the edx register.

Hi,

   Both MUL and IMUL fill the EDX register when multiplying EAX.
And they should take about the same time to execute.  Jochen
is comparing somewhat different code to try to approximate
real code.

Code: [Select]
  .Repeat
mov eax, 1000
imul eax, eax, 7
dec ebx
  .Until Sign?

  .Repeat
mov eax, 1000
mov edx, 7
mul edx
dec ebx
  .Until Sign?

   If the "mov edx, 7" was moved outside the loop the timings
should be closer.  But rather less probable in real code.

   Another possibility would be defining a variable in memory to
hold a 7, and use that "MUL [Seven]".  Still another memory
access, but possibly a bit closer to his IMUL example.  One chooses
an example and tests that one.

Regards,

Steve N.
Title: Re: Mul instruction
Post by: dedndave on May 03, 2014, 02:45:24 AM
Both MUL and IMUL fill the EDX register when multiplying EAX.

really ?   :redface:
Title: Re: Mul instruction
Post by: jj2007 on May 03, 2014, 03:32:51 AM
   If the "mov edx, 7" was moved outside the loop the timings
should be closer.

With a mov ecx, 7 before the loop, timings drop from 411 to 357, but it's still by far the slowest way to multiply.

@dave: not really ;-)
Title: Re: Mul instruction
Post by: FORTRANS on May 03, 2014, 03:59:34 AM
Hi,

   Oops.  It depends on which version of IMUL is used.  The
one argument version does fill it.  The two and three argument
versions do not.  And of course I don't use IMUL often, and
when I did, I used the one argument version.  Habit, I guess.

Regards,

Steve N.
Title: Re: Mul instruction
Post by: jj2007 on May 03, 2014, 07:33:27 AM
It depends on which version of IMUL is used.  The one argument version does fill it.  The two and three argument versions do not.

Little demo:

include \masm32\MasmBasic\MasmBasic.inc      ; download (http://masm32.com/board/index.php?topic=94.0)
  Init
  mov edx, 87654321
  mov ecx, 200
  imul eax, ecx, 1000
  imul ecx, 50
  deb 4, "three+two args, edx unchanged", eax, ecx, edx
  mov eax, 123456789
  imul ecx
  push edx      ; we'll display the edx:eax
  push eax      ; pair as a qword via xmm0
  movlps xmm0, QWORD PTR [esp]
  deb 4, "one arg, 123456789*10000", edx, eax, ecx, xmm0
  add esp, 2*DWORD
  Exit
end start


Output:

three+two args, edx unchanged
eax             200000
ecx             10000
edx             87654321

one arg, 123456789*10000
edx             287
eax             1912276048
ecx             10000
xmm0            1234567890000
Title: Re: Mul instruction
Post by: FORTRANS on May 05, 2014, 06:27:11 AM
Hi Jochen,

   Nice demonstration, I may have to try out MasmBasic on an XP
computer here.  Amusing that all the numbers were positive.

   Looking at IMUL, I wonder if long sequences of multiplies would
be sped up any by using a series of different target registers?

Cheers,

Steve
Title: Re: Mul instruction
Post by: jj2007 on May 05, 2014, 07:41:09 AM
Hi Jochen,

   Nice demonstration, I may have to try out MasmBasic on an XP
computer here.  Amusing that all the numbers were positive.

Thanks ;-)

With mov ecx, -200:

three+two args, edx unchanged
eax             -200000
ecx             -10000
edx             87654321

one arg, 123456789*10000
edx             -288
eax             -1912276048
ecx             -10000
xmm0            -1234567890000


Quote
  Looking at IMUL, I wonder if long sequences of multiplies would
be sped up any by using a series of different target registers?

You mean alternating regs? Can you give an example?
Title: Re: Mul instruction
Post by: FORTRANS on May 05, 2014, 08:15:20 AM
Hi,

   Would the first block of code be "noticeably" faster than the
second?

Code: [Select]
       MOV     EAX,[Mem1]
       MOV     EBX,[Mem2]
       MOV     ECX,[Mem3]
       MOV     EDX,[Mem4]
       MOV     EDI,[Mem5]
       MOV     ESI,[Mem6]
       IMUL    EAX,[Arg1]
       IMUL    EBX,[Arg2]
       IMUL    ECX,[Arg3]
       IMUL    EDX,[Arg4]
       IMUL    EDI,[Arg5]
       IMUL    ESI,[Arg6]
       MOV     [Res1],EAX
       MOV     [Res2],EBX
       MOV     [Res3],ECX
       MOV     [Res4],EDX
       MOV     [Res5],EDI
       MOV     [Res6],ESI

       MOV     EAX,[Mem1]
       IMUL    EAX,[Arg1]
       MOV     [Res1],EAX
       MOV     EAX,[Mem2]
       IMUL    EAX,[Arg2]
       MOV     [Res2],EAX
       MOV     EAX,[Mem3]
       IMUL    EAX,[Arg3]
       MOV     [Res3],EAX
       MOV     EAX,[Mem4]
       IMUL    EAX,[Arg4]
       MOV     [Res4],EAX
       MOV     EAX,[Mem5]
       IMUL    EAX,[Arg5]
       MOV     [Res5],EAX
       MOV     EAX,[Mem6]
       IMUL    EAX,[Arg6]
       MOV     [Res6],EAX

Regards,

Steve N.
Title: Re: Mul instruction
Post by: jj2007 on May 05, 2014, 04:51:30 PM
   Would the first block of code be "noticeably" faster than the
second?

"Noticeably" indeed, Steve: one cycle less. It's ten bytes longer, though, so occasionally the cache might make block B the better solution.

AMD Athlon(tm) Dual Core Processor 4450B (SSE3)

1520    cycles for 100 * block A
1610    cycles for 100 * block B

1513    cycles for 100 * block A
1611    cycles for 100 * block B

1513    cycles for 100 * block A
1614    cycles for 100 * block B

111     bytes for block A
101     bytes for block B
Title: Re: Mul instruction
Post by: FORTRANS on May 05, 2014, 10:36:50 PM
Hi,

   Using the median of the timing results (rounded); AMD Athlon 6.5%,
P-III 1.8%, P-MMX 7.8%, and Pentium M -1.7%.  Cute results, thank
you for the benchmark.  Doesn't make the 15% rule in any event.


Code: [Select]
pre-P4 (SSE1)

1698 cycles for 100 * block A
1728 cycles for 100 * block B

1711 cycles for 100 * block A
1732 cycles for 100 * block B

1701 cycles for 100 * block A
1733 cycles for 100 * block B

111 bytes for block A
101 bytes for block B


--- ok --- pre-P4
6471 cycles for 100 * block A
6974 cycles for 100 * block B

6481 cycles for 100 * block A
6978 cycles for 100 * block B

6472 cycles for 100 * block A
6979 cycles for 100 * block B

111 bytes for block A
101 bytes for block B


--- ok --- Intel(R) Pentium(R) M processor 1.70GHz (SSE2)

1360 cycles for 100 * block A
1335 cycles for 100 * block B

1358 cycles for 100 * block A
1331 cycles for 100 * block B

1356 cycles for 100 * block A
1337 cycles for 100 * block B

111 bytes for block A
101 bytes for block B


--- ok ---

Regards,

Steve N.
Title: Re: Mul instruction
Post by: Gunther on May 06, 2014, 01:56:58 AM
Jochen,

Code: [Select]
Intel(R) Core(TM) i7-3770 CPU @ 3.40GHz (SSE4)

896     cycles for 100 * block A
896     cycles for 100 * block B

895     cycles for 100 * block A
895     cycles for 100 * block B

895     cycles for 100 * block A
894     cycles for 100 * block B

111     bytes for block A
101     bytes for block B

--- ok ---

Gunther
Title: Re: Mul instruction
Post by: jj2007 on May 06, 2014, 03:03:03 AM
Thanks - and here my old machine:

Intel(R) Celeron(R) M CPU        420  @ 1.60GHz (SSE3)

1349    cycles for 100 * block A
1329    cycles for 100 * block B

1349    cycles for 100 * block A
1328    cycles for 100 * block B

1352    cycles for 100 * block A
1331    cycles for 100 * block B

111     bytes for block A
101     bytes for block B
Title: Re: Mul instruction
Post by: gelatine1 on May 07, 2014, 04:55:04 AM
Why would someone use the Imul that only stores into eax ? it only allows to multiply two numbers a and b where log(a)+log(b) is smaller than 32 whereas the mul instruction increases that bound to 64 ?
Title: Re: Mul instruction
Post by: dedndave on May 07, 2014, 05:03:19 AM
it handles signed integers
it allows immediate operands
it allows multiplication of registers other than EAX (for MUL, one operand must always be in EAX)
it's probably a little faster, depending on the CPU

IMUL is often handy for calculating addresses that LEA won't handle (array index, etc)
you use the one that's most appropriate for what you want to do