Author Topic: Mul instruction  (Read 8639 times)

Gunther

  • Member
  • *****
  • Posts: 3515
  • Forgive your enemies, but never forget their names
Re: Mul instruction
« Reply #15 on: May 02, 2014, 03:47:33 AM »
Jochen,

the results:
Code: [Select]
Intel(R) Core(TM) i7-3770 CPU @ 3.40GHz (SSE4)

182     cycles for 100 * eax*5, lea
194     cycles for 100 * eax*5, imul
216     cycles for 100 * eax*7, lea
452     cycles for 100 * eax*7, imul
607     cycles for 100 * eax*7, mul

184     cycles for 100 * eax*5, lea
179     cycles for 100 * eax*5, imul
215     cycles for 100 * eax*7, lea
194     cycles for 100 * eax*7, imul
253     cycles for 100 * eax*7, mul

185     cycles for 100 * eax*5, lea
188     cycles for 100 * eax*5, imul
222     cycles for 100 * eax*7, lea
458     cycles for 100 * eax*7, imul
612     cycles for 100 * eax*7, mul

5       bytes for eax*5, lea
5       bytes for eax*5, imul
8       bytes for eax*7, lea
5       bytes for eax*7, imul
9       bytes for eax*7, mul

--- ok ---

Gunther
Get your facts first, and then you can distort them.

gelatine1

  • Member
  • **
  • Posts: 217
Re: Mul instruction
« Reply #16 on: May 02, 2014, 04:13:11 AM »
Code: [Select]
Intel(R) Core(TM) i5-2430M CPU @ 2.40GHz (SSE4)

163     cycles for 100 * eax*5, lea
184     cycles for 100 * eax*5, imul
203     cycles for 100 * eax*7, lea
221     cycles for 100 * eax*7, imul
234     cycles for 100 * eax*7, mul

163     cycles for 100 * eax*5, lea
189     cycles for 100 * eax*5, imul
204     cycles for 100 * eax*7, lea
180     cycles for 100 * eax*7, imul
238     cycles for 100 * eax*7, mul

162     cycles for 100 * eax*5, lea
185     cycles for 100 * eax*5, imul
231     cycles for 100 * eax*7, lea
185     cycles for 100 * eax*7, imul
232     cycles for 100 * eax*7, mul

5       bytes for eax*5, lea
5       bytes for eax*5, imul
8       bytes for eax*7, lea
5       bytes for eax*7, imul
9       bytes for eax*7, mul

Why exactly is mul so much slower as imul ? I mean imul has to do more than mul right ? Imul also has to handle the sign ? or am I wrong ?

dedndave

  • Member
  • *****
  • Posts: 8734
  • Still using Abacus 2.0
    • DednDave
Re: Mul instruction
« Reply #17 on: May 02, 2014, 04:22:12 AM »
those timings are for 100 passes
seeing as an instruction can only consume whole-number cycles......

you're looking at roughly 2 cycles for everyone - lol
they cache a little differently, so have different timings

btw - your i5 is a screamer
the numbers look different on older CPU's

dedndave

  • Member
  • *****
  • Posts: 8734
  • Still using Abacus 2.0
    • DednDave
Re: Mul instruction
« Reply #18 on: May 02, 2014, 04:26:32 AM »
here is my prescott - one of the pentium 4 variations...

Code: [Select]
Intel(R) Pentium(R) 4 CPU 3.00GHz (SSE3)

334     cycles for 100 * eax*5, lea
450     cycles for 100 * eax*5, imul
364     cycles for 100 * eax*7, lea
449     cycles for 100 * eax*7, imul
550     cycles for 100 * eax*7, mul

292     cycles for 100 * eax*5, lea
448     cycles for 100 * eax*5, imul
388     cycles for 100 * eax*7, lea
450     cycles for 100 * eax*7, imul
551     cycles for 100 * eax*7, mul

293     cycles for 100 * eax*5, lea
448     cycles for 100 * eax*5, imul
364     cycles for 100 * eax*7, lea
478     cycles for 100 * eax*7, imul
550     cycles for 100 * eax*7, mul

jj2007

  • Member
  • *****
  • Posts: 7548
  • Assembler is fun ;-)
    • MasmBasic
Re: Mul instruction
« Reply #19 on: May 02, 2014, 04:32:54 AM »
Why exactly is mul so much slower as imul ? I mean imul has to do more than mul right ?

On the contrary, mul has to fill also the edx register.

KeepingRealBusy

  • Member
  • ***
  • Posts: 426
Re: Mul instruction
« Reply #20 on: May 02, 2014, 04:42:54 AM »
AMD A8-3520M APU with Radeon(tm) HD Graphics (SSE3)

498     cycles for 100 * eax*5, lea
498     cycles for 100 * eax*5, imul
538     cycles for 100 * eax*7, lea
498     cycles for 100 * eax*7, imul
660     cycles for 100 * eax*7, mul

452     cycles for 100 * eax*5, lea
453     cycles for 100 * eax*5, imul
489     cycles for 100 * eax*7, lea
453     cycles for 100 * eax*7, imul
598     cycles for 100 * eax*7, mul

453     cycles for 100 * eax*5, lea
453     cycles for 100 * eax*5, imul
489     cycles for 100 * eax*7, lea
453     cycles for 100 * eax*7, imul
599     cycles for 100 * eax*7, mul

5       bytes for eax*5, lea
5       bytes for eax*5, imul
8       bytes for eax*7, lea
5       bytes for eax*7, imul
9       bytes for eax*7, mul


--- ok ---

Dave.

FORTRANS

  • Member
  • ****
  • Posts: 945
Re: Mul instruction
« Reply #21 on: May 03, 2014, 12:51:55 AM »
Why exactly is mul so much slower as imul ? I mean imul has to do more than mul right ?

On the contrary, mul has to fill also the edx register.

Hi,

   Both MUL and IMUL fill the EDX register when multiplying EAX.
And they should take about the same time to execute.  Jochen
is comparing somewhat different code to try to approximate
real code.

Code: [Select]
  .Repeat
mov eax, 1000
imul eax, eax, 7
dec ebx
  .Until Sign?

  .Repeat
mov eax, 1000
mov edx, 7
mul edx
dec ebx
  .Until Sign?

   If the "mov edx, 7" was moved outside the loop the timings
should be closer.  But rather less probable in real code.

   Another possibility would be defining a variable in memory to
hold a 7, and use that "MUL [Seven]".  Still another memory
access, but possibly a bit closer to his IMUL example.  One chooses
an example and tests that one.

Regards,

Steve N.

dedndave

  • Member
  • *****
  • Posts: 8734
  • Still using Abacus 2.0
    • DednDave
Re: Mul instruction
« Reply #22 on: May 03, 2014, 02:45:24 AM »
Both MUL and IMUL fill the EDX register when multiplying EAX.

really ?   :redface:

jj2007

  • Member
  • *****
  • Posts: 7548
  • Assembler is fun ;-)
    • MasmBasic
Re: Mul instruction
« Reply #23 on: May 03, 2014, 03:32:51 AM »
   If the "mov edx, 7" was moved outside the loop the timings
should be closer.

With a mov ecx, 7 before the loop, timings drop from 411 to 357, but it's still by far the slowest way to multiply.

@dave: not really ;-)

FORTRANS

  • Member
  • ****
  • Posts: 945
Re: Mul instruction
« Reply #24 on: May 03, 2014, 03:59:34 AM »
Hi,

   Oops.  It depends on which version of IMUL is used.  The
one argument version does fill it.  The two and three argument
versions do not.  And of course I don't use IMUL often, and
when I did, I used the one argument version.  Habit, I guess.

Regards,

Steve N.

jj2007

  • Member
  • *****
  • Posts: 7548
  • Assembler is fun ;-)
    • MasmBasic
Re: Mul instruction
« Reply #25 on: May 03, 2014, 07:33:27 AM »
It depends on which version of IMUL is used.  The one argument version does fill it.  The two and three argument versions do not.

Little demo:

include \masm32\MasmBasic\MasmBasic.inc      ; download
  Init
  mov edx, 87654321
  mov ecx, 200
  imul eax, ecx, 1000
  imul ecx, 50
  deb 4, "three+two args, edx unchanged", eax, ecx, edx
  mov eax, 123456789
  imul ecx
  push edx      ; we'll display the edx:eax
  push eax      ; pair as a qword via xmm0
  movlps xmm0, QWORD PTR [esp]
  deb 4, "one arg, 123456789*10000", edx, eax, ecx, xmm0
  add esp, 2*DWORD
  Exit
end start


Output:

three+two args, edx unchanged
eax             200000
ecx             10000
edx             87654321

one arg, 123456789*10000
edx             287
eax             1912276048
ecx             10000
xmm0            1234567890000

FORTRANS

  • Member
  • ****
  • Posts: 945
Re: Mul instruction
« Reply #26 on: May 05, 2014, 06:27:11 AM »
Hi Jochen,

   Nice demonstration, I may have to try out MasmBasic on an XP
computer here.  Amusing that all the numbers were positive.

   Looking at IMUL, I wonder if long sequences of multiplies would
be sped up any by using a series of different target registers?

Cheers,

Steve

jj2007

  • Member
  • *****
  • Posts: 7548
  • Assembler is fun ;-)
    • MasmBasic
Re: Mul instruction
« Reply #27 on: May 05, 2014, 07:41:09 AM »
Hi Jochen,

   Nice demonstration, I may have to try out MasmBasic on an XP
computer here.  Amusing that all the numbers were positive.

Thanks ;-)

With mov ecx, -200:

three+two args, edx unchanged
eax             -200000
ecx             -10000
edx             87654321

one arg, 123456789*10000
edx             -288
eax             -1912276048
ecx             -10000
xmm0            -1234567890000


Quote
  Looking at IMUL, I wonder if long sequences of multiplies would
be sped up any by using a series of different target registers?

You mean alternating regs? Can you give an example?

FORTRANS

  • Member
  • ****
  • Posts: 945
Re: Mul instruction
« Reply #28 on: May 05, 2014, 08:15:20 AM »
Hi,

   Would the first block of code be "noticeably" faster than the
second?

Code: [Select]
       MOV     EAX,[Mem1]
       MOV     EBX,[Mem2]
       MOV     ECX,[Mem3]
       MOV     EDX,[Mem4]
       MOV     EDI,[Mem5]
       MOV     ESI,[Mem6]
       IMUL    EAX,[Arg1]
       IMUL    EBX,[Arg2]
       IMUL    ECX,[Arg3]
       IMUL    EDX,[Arg4]
       IMUL    EDI,[Arg5]
       IMUL    ESI,[Arg6]
       MOV     [Res1],EAX
       MOV     [Res2],EBX
       MOV     [Res3],ECX
       MOV     [Res4],EDX
       MOV     [Res5],EDI
       MOV     [Res6],ESI

       MOV     EAX,[Mem1]
       IMUL    EAX,[Arg1]
       MOV     [Res1],EAX
       MOV     EAX,[Mem2]
       IMUL    EAX,[Arg2]
       MOV     [Res2],EAX
       MOV     EAX,[Mem3]
       IMUL    EAX,[Arg3]
       MOV     [Res3],EAX
       MOV     EAX,[Mem4]
       IMUL    EAX,[Arg4]
       MOV     [Res4],EAX
       MOV     EAX,[Mem5]
       IMUL    EAX,[Arg5]
       MOV     [Res5],EAX
       MOV     EAX,[Mem6]
       IMUL    EAX,[Arg6]
       MOV     [Res6],EAX

Regards,

Steve N.

jj2007

  • Member
  • *****
  • Posts: 7548
  • Assembler is fun ;-)
    • MasmBasic
Re: Mul instruction
« Reply #29 on: May 05, 2014, 04:51:30 PM »
   Would the first block of code be "noticeably" faster than the
second?

"Noticeably" indeed, Steve: one cycle less. It's ten bytes longer, though, so occasionally the cache might make block B the better solution.

AMD Athlon(tm) Dual Core Processor 4450B (SSE3)

1520    cycles for 100 * block A
1610    cycles for 100 * block B

1513    cycles for 100 * block A
1611    cycles for 100 * block B

1513    cycles for 100 * block A
1614    cycles for 100 * block B

111     bytes for block A
101     bytes for block B