News:

Masm32 SDK description, downloads and other helpful links
Message to All Guests

Main Menu

Mul instruction

Started by gelatine1, May 01, 2014, 04:36:56 PM

Previous topic - Next topic

gelatine1

Which complexity does the mul instruction use ?  Is it possible to make something faster than the mul instruction (with a faster algorithm)?

jj2007

Search The Laboratory for imul lea shl. Check also \Masm32\Help\Opcodes.chm for these instructions.

gelatine1

Okay, I checked the mul instruction (I wanted unsigned) in the file But it doesn't say anything exactly if this can be done faster or not ?
And what do they mean with EDX:EAX ?

Gunther

Hi gelatine1,

Quote from: gelatine1 on May 01, 2014, 07:29:51 PM
Okay, I checked the mul instruction (I wanted unsigned) in the file But it doesn't say anything exactly if this can be done faster or not ?
And what do they mean with EDX:EAX ?

a multiplication of 2 32-bit numbers leads to a 64-bit result. EAX holds bits 0-31 and EDX holds bits 32-63 of the result. There's nothing magic and the Intel or AMD manuals would have tell you that "secret".

Gunther
You have to know the facts before you can distort them.

FORTRANS

Quote from: gelatine1 on May 01, 2014, 07:29:51 PM
Okay, I checked the mul instruction (I wanted unsigned) in the file But it doesn't say anything exactly if this can be done faster or not ?

Hi,

   In the general case, mostly no.  A MUL instruction will be the
fastest way to multiply.  For special cases you can do better
using other logic.  For instance, multiplying by a factor of two
might be faster using SHL 1, shift left by one position.  jj2007
mentioned the LEA instruction, that can be used for a fast
multiply by 5 in some cases.  However you have to be careful
that the argument being multiplied is not too large.

   Until you really need to speed some code up, I would suggest
you use MUL or IMUL and just work on getting the right answer.
Unless you just want to play around getting things done in a
different way of course.  That can be fun in itself at times.

Cheers,

Steve N.

dedndave

MUL is fairly fast (8 to 10 cycles on my machine)
it's DIV that is slow   :P

gelatine1

Thanks for the replies, I have a few more questions:
@FORTRANS: How exactly would I use lea to multiply by 5 ? I am just wondering because I wouldn't actually now how it would be done.
@dedndave: Do you mean that I could write some code that runs faster than the div instruction ? Or do you mean that a division is slow in general ?

dedndave

both are true, to some degree
DIV is slow - varies from machine to machine
but, on mine it runs 60 to 80 cycles or so

many guys have figured out ways to multiply by the inverse of a number
this may be used as a fast division in cases where the divisor is a constant
qWord has written a nice tool for that called Magic Number

LEA.....
    lea     eax,[4*edx+edx]    ;EAX = 5*EDX, provided it doesn't overflow 32 bits

works well on most modern processors
although, not wonderful on the pentium 4's - MUL may be as good, IMUL slightly better


FORTRANS

Quote from: gelatine1 on May 02, 2014, 01:02:04 AM
How exactly would I use lea to multiply by 5 ? I am just wondering because I wouldn't actually now how it would be done.

Hi,

   From the Assembly Gems site, in their "LEA a power command"
page, I got the following:

        lea     ebx,[ebx+ebx*4] ; ebx:=ebx*5

   Lots of confusing stuff there.

http://www.df.lth.se/~john_e/fr_gems.html

Cheers,

Steve N.

gelatine1

Thanks a lot for the replies once again, I'm still wondering why exactly this 'lea trick' runs faster. It seems very clever but why would it only be faster for a multiplication with 5 ? Why wouldnt it be faster for multiplication with any number ?
Ill take a look to the magic number later when i have finished my multiplication (I'm writing a 64bit multiplication)
And thank you for the gem page. I only looked for a minute and i read a lot of interesting stuff already.

jj2007

It can't be faster for any number because it doesn't work with "any" number.

lea ebx,[ebx+ebx*2] ; ebx:=ebx*3

lea ebx,[ebx+ebx*4] ; ebx:=ebx*5

lea ebx,[ebx+ebx*8] ; ebx:=ebx*9

lea ebx,[ebx+ebx*4] ; ebx:=ebx*10
add ebx, ebx

lea edx,[ebx+ebx*2] ; ebx:=ebx*7
lea ebx,[edx+ebx*4]

KeepingRealBusy

Quote from: gelatine1 on May 02, 2014, 01:29:48 AM
Thanks a lot for the replies once again, I'm still wondering why exactly this 'lea trick' runs faster. It seems very clever but why would it only be faster for a multiplication with 5 ? Why wouldnt it be faster for multiplication with any number ?
Ill take a look to the magic number later when i have finished my multiplication (I'm writing a 64bit multiplication)
And thank you for the gem page. I only looked for a minute and i read a lot of interesting stuff already.

The reason it runs faster (LEA ebx,[ebx+ebx*4]) is that the computation of x*4 and x*1 is done in the pipeline before it ever gets to the operation (arithmetic) unit, and the result is then just placed into the register by the OU. A multiplication is not actually done, the x is shifted by 2 bits and just added to x. Note LEA means Load Effective Address and the pipeline is used to calculate addresses for memory accesses, but the LEA instruction makes use of this fact and uses the calculated "address" as an operand to be loaded into a register.

Dave.

jj2007

There are differences in speed, but you'll have to test them for every CPU. Example:

Intel(R) Celeron(R) M CPU        420  @ 1.60GHz (SSE3)
260     cycles for 100 * eax*5, lea
281     cycles for 100 * eax*5, imul
285     cycles for 100 * eax*7, lea
277     cycles for 100 * eax*7, imul
409     cycles for 100 * eax*7, mul

260     cycles for 100 * eax*5, lea
281     cycles for 100 * eax*5, imul
285     cycles for 100 * eax*7, lea
281     cycles for 100 * eax*7, imul
409     cycles for 100 * eax*7, mul

261     cycles for 100 * eax*5, lea
281     cycles for 100 * eax*5, imul
285     cycles for 100 * eax*7, lea
281     cycles for 100 * eax*7, imul
409     cycles for 100 * eax*7, mul

5       bytes for eax*5, lea
5       bytes for eax*5, imul
8       bytes for eax*7, lea
5       bytes for eax*7, imul
9       bytes for eax*7, mul

FORTRANS

P-MMX W98

pre-P4
424   cycles for 100 * eax*5, lea
1137   cycles for 100 * eax*5, imul
625   cycles for 100 * eax*7, lea
1136   cycles for 100 * eax*7, imul
1139   cycles for 100 * eax*7, mul

424   cycles for 100 * eax*5, lea
1138   cycles for 100 * eax*5, imul
627   cycles for 100 * eax*7, lea
1138   cycles for 100 * eax*7, imul
1141   cycles for 100 * eax*7, mul

429   cycles for 100 * eax*5, lea
1163   cycles for 100 * eax*5, imul
650   cycles for 100 * eax*7, lea
1168   cycles for 100 * eax*7, imul
1157   cycles for 100 * eax*7, mul

5   bytes for eax*5, lea
5   bytes for eax*5, imul
8   bytes for eax*7, lea
5   bytes for eax*7, imul
9   bytes for eax*7, mul


--- ok ---

P-III W2k

pre-P4 (SSE1)

312   cycles for 100 * eax*5, lea
312   cycles for 100 * eax*5, imul
312   cycles for 100 * eax*7, lea
312   cycles for 100 * eax*7, imul
418   cycles for 100 * eax*7, mul

312   cycles for 100 * eax*5, lea
312   cycles for 100 * eax*5, imul
312   cycles for 100 * eax*7, lea
311   cycles for 100 * eax*7, imul
420   cycles for 100 * eax*7, mul

312   cycles for 100 * eax*5, lea
312   cycles for 100 * eax*5, imul
312   cycles for 100 * eax*7, lea
312   cycles for 100 * eax*7, imul
419   cycles for 100 * eax*7, mul

5   bytes for eax*5, lea
5   bytes for eax*5, imul
8   bytes for eax*7, lea
5   bytes for eax*7, imul
9   bytes for eax*7, mul


--- ok ---