News:

Masm32 SDK description, downloads and other helpful links
Message to All Guests

Main Menu

Mul instruction

Started by gelatine1, May 01, 2014, 04:36:56 PM

Previous topic - Next topic

Gunther

Jochen,

the results:

Intel(R) Core(TM) i7-3770 CPU @ 3.40GHz (SSE4)

182     cycles for 100 * eax*5, lea
194     cycles for 100 * eax*5, imul
216     cycles for 100 * eax*7, lea
452     cycles for 100 * eax*7, imul
607     cycles for 100 * eax*7, mul

184     cycles for 100 * eax*5, lea
179     cycles for 100 * eax*5, imul
215     cycles for 100 * eax*7, lea
194     cycles for 100 * eax*7, imul
253     cycles for 100 * eax*7, mul

185     cycles for 100 * eax*5, lea
188     cycles for 100 * eax*5, imul
222     cycles for 100 * eax*7, lea
458     cycles for 100 * eax*7, imul
612     cycles for 100 * eax*7, mul

5       bytes for eax*5, lea
5       bytes for eax*5, imul
8       bytes for eax*7, lea
5       bytes for eax*7, imul
9       bytes for eax*7, mul

--- ok ---


Gunther
You have to know the facts before you can distort them.

gelatine1


Intel(R) Core(TM) i5-2430M CPU @ 2.40GHz (SSE4)

163     cycles for 100 * eax*5, lea
184     cycles for 100 * eax*5, imul
203     cycles for 100 * eax*7, lea
221     cycles for 100 * eax*7, imul
234     cycles for 100 * eax*7, mul

163     cycles for 100 * eax*5, lea
189     cycles for 100 * eax*5, imul
204     cycles for 100 * eax*7, lea
180     cycles for 100 * eax*7, imul
238     cycles for 100 * eax*7, mul

162     cycles for 100 * eax*5, lea
185     cycles for 100 * eax*5, imul
231     cycles for 100 * eax*7, lea
185     cycles for 100 * eax*7, imul
232     cycles for 100 * eax*7, mul

5       bytes for eax*5, lea
5       bytes for eax*5, imul
8       bytes for eax*7, lea
5       bytes for eax*7, imul
9       bytes for eax*7, mul


Why exactly is mul so much slower as imul ? I mean imul has to do more than mul right ? Imul also has to handle the sign ? or am I wrong ?

dedndave

those timings are for 100 passes
seeing as an instruction can only consume whole-number cycles......

you're looking at roughly 2 cycles for everyone - lol
they cache a little differently, so have different timings

btw - your i5 is a screamer
the numbers look different on older CPU's

dedndave

here is my prescott - one of the pentium 4 variations...

Intel(R) Pentium(R) 4 CPU 3.00GHz (SSE3)

334     cycles for 100 * eax*5, lea
450     cycles for 100 * eax*5, imul
364     cycles for 100 * eax*7, lea
449     cycles for 100 * eax*7, imul
550     cycles for 100 * eax*7, mul

292     cycles for 100 * eax*5, lea
448     cycles for 100 * eax*5, imul
388     cycles for 100 * eax*7, lea
450     cycles for 100 * eax*7, imul
551     cycles for 100 * eax*7, mul

293     cycles for 100 * eax*5, lea
448     cycles for 100 * eax*5, imul
364     cycles for 100 * eax*7, lea
478     cycles for 100 * eax*7, imul
550     cycles for 100 * eax*7, mul

jj2007

Quote from: gelatine1 on May 02, 2014, 04:13:11 AMWhy exactly is mul so much slower as imul ? I mean imul has to do more than mul right ?

On the contrary, mul has to fill also the edx register.

KeepingRealBusy

AMD A8-3520M APU with Radeon(tm) HD Graphics (SSE3)

498     cycles for 100 * eax*5, lea
498     cycles for 100 * eax*5, imul
538     cycles for 100 * eax*7, lea
498     cycles for 100 * eax*7, imul
660     cycles for 100 * eax*7, mul

452     cycles for 100 * eax*5, lea
453     cycles for 100 * eax*5, imul
489     cycles for 100 * eax*7, lea
453     cycles for 100 * eax*7, imul
598     cycles for 100 * eax*7, mul

453     cycles for 100 * eax*5, lea
453     cycles for 100 * eax*5, imul
489     cycles for 100 * eax*7, lea
453     cycles for 100 * eax*7, imul
599     cycles for 100 * eax*7, mul

5       bytes for eax*5, lea
5       bytes for eax*5, imul
8       bytes for eax*7, lea
5       bytes for eax*7, imul
9       bytes for eax*7, mul


--- ok ---

Dave.

FORTRANS

Quote from: jj2007 on May 02, 2014, 04:32:54 AM
Quote from: gelatine1 on May 02, 2014, 04:13:11 AMWhy exactly is mul so much slower as imul ? I mean imul has to do more than mul right ?

On the contrary, mul has to fill also the edx register.

Hi,

   Both MUL and IMUL fill the EDX register when multiplying EAX.
And they should take about the same time to execute.  Jochen
is comparing somewhat different code to try to approximate
real code.


  .Repeat
mov eax, 1000
imul eax, eax, 7
dec ebx
  .Until Sign?

  .Repeat
mov eax, 1000
mov edx, 7
mul edx
dec ebx
  .Until Sign?


   If the "mov edx, 7" was moved outside the loop the timings
should be closer.  But rather less probable in real code.

   Another possibility would be defining a variable in memory to
hold a 7, and use that "MUL [Seven]".  Still another memory
access, but possibly a bit closer to his IMUL example.  One chooses
an example and tests that one.

Regards,

Steve N.

dedndave

Quote from: FORTRANS on May 03, 2014, 12:51:55 AM
Both MUL and IMUL fill the EDX register when multiplying EAX.

really ?   :redface:

jj2007

Quote from: FORTRANS on May 03, 2014, 12:51:55 AM
   If the "mov edx, 7" was moved outside the loop the timings
should be closer.

With a mov ecx, 7 before the loop, timings drop from 411 to 357, but it's still by far the slowest way to multiply.

@dave: not really ;-)

FORTRANS

Hi,

   Oops.  It depends on which version of IMUL is used.  The
one argument version does fill it.  The two and three argument
versions do not.  And of course I don't use IMUL often, and
when I did, I used the one argument version.  Habit, I guess.

Regards,

Steve N.

jj2007

Quote from: FORTRANS on May 03, 2014, 03:59:34 AMIt depends on which version of IMUL is used.  The one argument version does fill it.  The two and three argument versions do not.

Little demo:

include \masm32\MasmBasic\MasmBasic.inc      ; download
  Init
  mov edx, 87654321
  mov ecx, 200
  imul eax, ecx, 1000
  imul ecx, 50
  deb 4, "three+two args, edx unchanged", eax, ecx, edx
  mov eax, 123456789
  imul ecx
  push edx      ; we'll display the edx:eax
  push eax      ; pair as a qword via xmm0
  movlps xmm0, QWORD PTR [esp]
  deb 4, "one arg, 123456789*10000", edx, eax, ecx, xmm0
  add esp, 2*DWORD
  Exit
end start


Output:

three+two args, edx unchanged
eax             200000
ecx             10000
edx             87654321

one arg, 123456789*10000
edx             287
eax             1912276048
ecx             10000
xmm0            1234567890000

FORTRANS

Hi Jochen,

   Nice demonstration, I may have to try out MasmBasic on an XP
computer here.  Amusing that all the numbers were positive.

   Looking at IMUL, I wonder if long sequences of multiplies would
be sped up any by using a series of different target registers?

Cheers,

Steve

jj2007

Quote from: FORTRANS on May 05, 2014, 06:27:11 AM
Hi Jochen,

   Nice demonstration, I may have to try out MasmBasic on an XP
computer here.  Amusing that all the numbers were positive.

Thanks ;-)

With mov ecx, -200:

three+two args, edx unchanged
eax             -200000
ecx             -10000
edx             87654321

one arg, 123456789*10000
edx             -288
eax             -1912276048
ecx             -10000
xmm0            -1234567890000


QuoteLooking at IMUL, I wonder if long sequences of multiplies would
be sped up any by using a series of different target registers?

You mean alternating regs? Can you give an example?

FORTRANS

Hi,

   Would the first block of code be "noticeably" faster than the
second?


       MOV     EAX,[Mem1]
       MOV     EBX,[Mem2]
       MOV     ECX,[Mem3]
       MOV     EDX,[Mem4]
       MOV     EDI,[Mem5]
       MOV     ESI,[Mem6]
       IMUL    EAX,[Arg1]
       IMUL    EBX,[Arg2]
       IMUL    ECX,[Arg3]
       IMUL    EDX,[Arg4]
       IMUL    EDI,[Arg5]
       IMUL    ESI,[Arg6]
       MOV     [Res1],EAX
       MOV     [Res2],EBX
       MOV     [Res3],ECX
       MOV     [Res4],EDX
       MOV     [Res5],EDI
       MOV     [Res6],ESI

       MOV     EAX,[Mem1]
       IMUL    EAX,[Arg1]
       MOV     [Res1],EAX
       MOV     EAX,[Mem2]
       IMUL    EAX,[Arg2]
       MOV     [Res2],EAX
       MOV     EAX,[Mem3]
       IMUL    EAX,[Arg3]
       MOV     [Res3],EAX
       MOV     EAX,[Mem4]
       IMUL    EAX,[Arg4]
       MOV     [Res4],EAX
       MOV     EAX,[Mem5]
       IMUL    EAX,[Arg5]
       MOV     [Res5],EAX
       MOV     EAX,[Mem6]
       IMUL    EAX,[Arg6]
       MOV     [Res6],EAX


Regards,

Steve N.

jj2007

Quote from: FORTRANS on May 05, 2014, 08:15:20 AM
   Would the first block of code be "noticeably" faster than the
second?

"Noticeably" indeed, Steve: one cycle less. It's ten bytes longer, though, so occasionally the cache might make block B the better solution.

AMD Athlon(tm) Dual Core Processor 4450B (SSE3)

1520    cycles for 100 * block A
1610    cycles for 100 * block B

1513    cycles for 100 * block A
1611    cycles for 100 * block B

1513    cycles for 100 * block A
1614    cycles for 100 * block B

111     bytes for block A
101     bytes for block B