Mul instruction

gelatine1 · May 01, 2014, 04:36:56 PM

Which complexity does the mul instruction use ? Is it possible to make something faster than the mul instruction (with a faster algorithm)?

jj2007 · May 01, 2014, 05:12:16 PM

Search The Laboratory for imul lea shl. Check also \Masm32\Help\Opcodes.chm for these instructions.

gelatine1 · May 01, 2014, 07:29:51 PM

Okay, I checked the mul instruction (I wanted unsigned) in the file But it doesn't say anything exactly if this can be done faster or not ?
And what do they mean with EDX:EAX ?

Gunther · May 01, 2014, 07:36:45 PM

Hi gelatine1,

Quote from: gelatine1 on May 01, 2014, 07:29:51 PM
Okay, I checked the mul instruction (I wanted unsigned) in the file But it doesn't say anything exactly if this can be done faster or not ?
And what do they mean with EDX:EAX ?

a multiplication of 2 32-bit numbers leads to a 64-bit result. EAX holds bits 0-31 and EDX holds bits 32-63 of the result. There's nothing magic and the Intel or AMD manuals would have tell you that "secret".

Gunther

FORTRANS · May 01, 2014, 10:47:20 PM

Quote from: gelatine1 on May 01, 2014, 07:29:51 PM
Okay, I checked the mul instruction (I wanted unsigned) in the file But it doesn't say anything exactly if this can be done faster or not ?

Hi,

In the general case, mostly no. A MUL instruction will be the
fastest way to multiply. For special cases you can do better
using other logic. For instance, multiplying by a factor of two
might be faster using SHL 1, shift left by one position. jj2007
mentioned the LEA instruction, that can be used for a fast
multiply by 5 in some cases. However you have to be careful
that the argument being multiplied is not too large.

Until you really need to speed some code up, I would suggest
you use MUL or IMUL and just work on getting the right answer.
Unless you just want to play around getting things done in a
different way of course. That can be fun in itself at times.

Cheers,

Steve N.

dedndave · May 02, 2014, 12:29:42 AM

MUL is fairly fast (8 to 10 cycles on my machine)
it's DIV that is slow :P

gelatine1 · May 02, 2014, 01:02:04 AM

Thanks for the replies, I have a few more questions:
@FORTRANS: How exactly would I use lea to multiply by 5 ? I am just wondering because I wouldn't actually now how it would be done.
@dedndave: Do you mean that I could write some code that runs faster than the div instruction ? Or do you mean that a division is slow in general ?

dedndave · May 02, 2014, 01:14:50 AM

both are true, to some degree
DIV is slow - varies from machine to machine
but, on mine it runs 60 to 80 cycles or so

many guys have figured out ways to multiply by the inverse of a number
this may be used as a fast division in cases where the divisor is a constant
qWord has written a nice tool for that called Magic Number

LEA.....

Code Select

lea eax,[4*edx+edx] ;EAX = 5*EDX, provided it doesn't overflow 32 bits

works well on most modern processors
although, not wonderful on the pentium 4's - MUL may be as good, IMUL slightly better

dedndave · May 02, 2014, 01:17:58 AM

http://www.masmforum.com/archive2012/6717_MagicNumber64.zip

FORTRANS · May 02, 2014, 01:24:02 AM

Quote from: gelatine1 on May 02, 2014, 01:02:04 AM
How exactly would I use lea to multiply by 5 ? I am just wondering because I wouldn't actually now how it would be done.

Hi,

From the Assembly Gems site, in their "LEA a power command"
page, I got the following:

Code Select

lea ebx,[ebx+ebx*4] ; ebx:=ebx*5

Lots of confusing stuff there.

http://www.df.lth.se/~john_e/fr_gems.html

Cheers,

Steve N.

gelatine1 · May 02, 2014, 01:29:48 AM

Thanks a lot for the replies once again, I'm still wondering why exactly this 'lea trick' runs faster. It seems very clever but why would it only be faster for a multiplication with 5 ? Why wouldnt it be faster for multiplication with any number ?
Ill take a look to the magic number later when i have finished my multiplication (I'm writing a 64bit multiplication)
And thank you for the gem page. I only looked for a minute and i read a lot of interesting stuff already.

jj2007 · May 02, 2014, 01:31:52 AM

It can't be faster for any number because it doesn't work with "any" number.

lea ebx,[ebx+ebx*2] ; ebx:=ebx*3

lea ebx,[ebx+ebx*4] ; ebx:=ebx*5

lea ebx,[ebx+ebx*8] ; ebx:=ebx*9

lea ebx,[ebx+ebx*4] ; ebx:=ebx*10
add ebx, ebx

lea edx,[ebx+ebx*2] ; ebx:=ebx*7
lea ebx,[edx+ebx*4]

KeepingRealBusy · May 02, 2014, 01:44:52 AM

Quote from: gelatine1 on May 02, 2014, 01:29:48 AM
Thanks a lot for the replies once again, I'm still wondering why exactly this 'lea trick' runs faster. It seems very clever but why would it only be faster for a multiplication with 5 ? Why wouldnt it be faster for multiplication with any number ?
Ill take a look to the magic number later when i have finished my multiplication (I'm writing a 64bit multiplication)
And thank you for the gem page. I only looked for a minute and i read a lot of interesting stuff already.

The reason it runs faster (LEA ebx,[ebx+ebx*4]) is that the computation of x*4 and x*1 is done in the pipeline before it ever gets to the operation (arithmetic) unit, and the result is then just placed into the register by the OU. A multiplication is not actually done, the x is shifted by 2 bits and just added to x. Note LEA means Load Effective Address and the pipeline is used to calculate addresses for memory accesses, but the LEA instruction makes use of this fact and uses the calculated "address" as an operand to be loaded into a register.

Dave.

jj2007 · May 02, 2014, 02:04:03 AM

There are differences in speed, but you'll have to test them for every CPU. Example:

Intel(R) Celeron(R) M CPU 420 @ 1.60GHz (SSE3)
260 cycles for 100 * eax*5, lea
281 cycles for 100 * eax*5, imul
285 cycles for 100 * eax*7, lea
277 cycles for 100 * eax*7, imul
409 cycles for 100 * eax*7, mul

260 cycles for 100 * eax*5, lea
281 cycles for 100 * eax*5, imul
285 cycles for 100 * eax*7, lea
281 cycles for 100 * eax*7, imul
409 cycles for 100 * eax*7, mul

261 cycles for 100 * eax*5, lea
281 cycles for 100 * eax*5, imul
285 cycles for 100 * eax*7, lea
281 cycles for 100 * eax*7, imul
409 cycles for 100 * eax*7, mul

5 bytes for eax*5, lea
5 bytes for eax*5, imul
8 bytes for eax*7, lea
5 bytes for eax*7, imul
9 bytes for eax*7, mul

FORTRANS · May 02, 2014, 03:46:40 AM

P-MMX W98

pre-P4
424   cycles for 100 * eax*5, lea
1137   cycles for 100 * eax*5, imul
625   cycles for 100 * eax*7, lea
1136   cycles for 100 * eax*7, imul
1139   cycles for 100 * eax*7, mul

424   cycles for 100 * eax*5, lea
1138   cycles for 100 * eax*5, imul
627   cycles for 100 * eax*7, lea
1138   cycles for 100 * eax*7, imul
1141   cycles for 100 * eax*7, mul

429   cycles for 100 * eax*5, lea
1163   cycles for 100 * eax*5, imul
650   cycles for 100 * eax*7, lea
1168   cycles for 100 * eax*7, imul
1157   cycles for 100 * eax*7, mul

5   bytes for eax*5, lea
5   bytes for eax*5, imul
8   bytes for eax*7, lea
5   bytes for eax*7, imul
9   bytes for eax*7, mul

--- ok ---

P-III W2k

pre-P4 (SSE1)

312   cycles for 100 * eax*5, lea
312   cycles for 100 * eax*5, imul
312   cycles for 100 * eax*7, lea
312   cycles for 100 * eax*7, imul
418   cycles for 100 * eax*7, mul

312   cycles for 100 * eax*5, lea
312   cycles for 100 * eax*5, imul
312   cycles for 100 * eax*7, lea
311   cycles for 100 * eax*7, imul
420   cycles for 100 * eax*7, mul

312   cycles for 100 * eax*5, lea
312   cycles for 100 * eax*5, imul
312   cycles for 100 * eax*7, lea
312   cycles for 100 * eax*7, imul
419   cycles for 100 * eax*7, mul

5   bytes for eax*5, lea
5   bytes for eax*5, imul
8   bytes for eax*7, lea
5   bytes for eax*7, imul
9   bytes for eax*7, mul

--- ok ---

The MASM Forum

News:

Mul instruction

gelatine1

jj2007

gelatine1

Gunther

FORTRANS

dedndave

gelatine1

dedndave

dedndave

FORTRANS

gelatine1

jj2007

KeepingRealBusy

jj2007

FORTRANS