Mul instruction

Gunther · May 02, 2014, 03:47:33 AM

Jochen,

the results:


Intel(R) Core(TM) i7-3770 CPU @ 3.40GHz (SSE4)

182     cycles for 100 * eax*5, lea
194     cycles for 100 * eax*5, imul
216     cycles for 100 * eax*7, lea
452     cycles for 100 * eax*7, imul
607     cycles for 100 * eax*7, mul

184     cycles for 100 * eax*5, lea
179     cycles for 100 * eax*5, imul
215     cycles for 100 * eax*7, lea
194     cycles for 100 * eax*7, imul
253     cycles for 100 * eax*7, mul

185     cycles for 100 * eax*5, lea
188     cycles for 100 * eax*5, imul
222     cycles for 100 * eax*7, lea
458     cycles for 100 * eax*7, imul
612     cycles for 100 * eax*7, mul

5       bytes for eax*5, lea
5       bytes for eax*5, imul
8       bytes for eax*7, lea
5       bytes for eax*7, imul
9       bytes for eax*7, mul

--- ok ---

Gunther

gelatine1 · May 02, 2014, 04:13:11 AM

Code Select


Intel(R) Core(TM) i5-2430M CPU @ 2.40GHz (SSE4)

163     cycles for 100 * eax*5, lea
184     cycles for 100 * eax*5, imul
203     cycles for 100 * eax*7, lea
221     cycles for 100 * eax*7, imul
234     cycles for 100 * eax*7, mul

163     cycles for 100 * eax*5, lea
189     cycles for 100 * eax*5, imul
204     cycles for 100 * eax*7, lea
180     cycles for 100 * eax*7, imul
238     cycles for 100 * eax*7, mul

162     cycles for 100 * eax*5, lea
185     cycles for 100 * eax*5, imul
231     cycles for 100 * eax*7, lea
185     cycles for 100 * eax*7, imul
232     cycles for 100 * eax*7, mul

5       bytes for eax*5, lea
5       bytes for eax*5, imul
8       bytes for eax*7, lea
5       bytes for eax*7, imul
9       bytes for eax*7, mul

Why exactly is mul so much slower as imul ? I mean imul has to do more than mul right ? Imul also has to handle the sign ? or am I wrong ?

dedndave · May 02, 2014, 04:22:12 AM

those timings are for 100 passes
seeing as an instruction can only consume whole-number cycles......

you're looking at roughly 2 cycles for everyone - lol
they cache a little differently, so have different timings

btw - your i5 is a screamer
the numbers look different on older CPU's

dedndave · May 02, 2014, 04:26:32 AM

here is my prescott - one of the pentium 4 variations...

Code Select

Intel(R) Pentium(R) 4 CPU 3.00GHz (SSE3)

334     cycles for 100 * eax*5, lea
450     cycles for 100 * eax*5, imul
364     cycles for 100 * eax*7, lea
449     cycles for 100 * eax*7, imul
550     cycles for 100 * eax*7, mul

292     cycles for 100 * eax*5, lea
448     cycles for 100 * eax*5, imul
388     cycles for 100 * eax*7, lea
450     cycles for 100 * eax*7, imul
551     cycles for 100 * eax*7, mul

293     cycles for 100 * eax*5, lea
448     cycles for 100 * eax*5, imul
364     cycles for 100 * eax*7, lea
478     cycles for 100 * eax*7, imul
550     cycles for 100 * eax*7, mul

jj2007 · May 02, 2014, 04:32:54 AM

Quote from: gelatine1 on May 02, 2014, 04:13:11 AMWhy exactly is mul so much slower as imul ? I mean imul has to do more than mul right ?

On the contrary, mul has to fill also the edx register.

KeepingRealBusy · May 02, 2014, 04:42:54 AM

AMD A8-3520M APU with Radeon(tm) HD Graphics (SSE3)

498 cycles for 100 * eax*5, lea
498 cycles for 100 * eax*5, imul
538 cycles for 100 * eax*7, lea
498 cycles for 100 * eax*7, imul
660 cycles for 100 * eax*7, mul

452 cycles for 100 * eax*5, lea
453 cycles for 100 * eax*5, imul
489 cycles for 100 * eax*7, lea
453 cycles for 100 * eax*7, imul
598 cycles for 100 * eax*7, mul

453 cycles for 100 * eax*5, lea
453 cycles for 100 * eax*5, imul
489 cycles for 100 * eax*7, lea
453 cycles for 100 * eax*7, imul
599 cycles for 100 * eax*7, mul

5 bytes for eax*5, lea
5 bytes for eax*5, imul
8 bytes for eax*7, lea
5 bytes for eax*7, imul
9 bytes for eax*7, mul

--- ok ---

Dave.

FORTRANS · May 03, 2014, 12:51:55 AM

Quote from: jj2007 on May 02, 2014, 04:32:54 AM
Quote from: gelatine1 on May 02, 2014, 04:13:11 AMWhy exactly is mul so much slower as imul ? I mean imul has to do more than mul right ?

On the contrary, mul has to fill also the edx register.

Hi,

Both MUL and IMUL fill the EDX register when multiplying EAX.
And they should take about the same time to execute. Jochen
is comparing somewhat different code to try to approximate
real code.

Code Select


  .Repeat
	mov eax, 1000
	imul eax, eax, 7
	dec ebx
  .Until Sign?

  .Repeat
	mov eax, 1000
	mov edx, 7
	mul edx
	dec ebx
  .Until Sign?

If the "mov edx, 7" was moved outside the loop the timings
should be closer. But rather less probable in real code.

Another possibility would be defining a variable in memory to
hold a 7, and use that "MUL [Seven]". Still another memory
access, but possibly a bit closer to his IMUL example. One chooses
an example and tests that one.

Regards,

Steve N.

dedndave · May 03, 2014, 02:45:24 AM

Quote from: FORTRANS on May 03, 2014, 12:51:55 AM
Both MUL and IMUL fill the EDX register when multiplying EAX.

really ? :redface:

jj2007 · May 03, 2014, 03:32:51 AM

Quote from: FORTRANS on May 03, 2014, 12:51:55 AM
If the "mov edx, 7" was moved outside the loop the timings
should be closer.

With a mov ecx, 7 before the loop, timings drop from 411 to 357, but it's still by far the slowest way to multiply.

@dave: not really ;-)

FORTRANS · May 03, 2014, 03:59:34 AM

Hi,

Oops. It depends on which version of IMUL is used. The
one argument version does fill it. The two and three argument
versions do not. And of course I don't use IMUL often, and
when I did, I used the one argument version. Habit, I guess.

Regards,

Steve N.

jj2007 · May 03, 2014, 07:33:27 AM

Quote from: FORTRANS on May 03, 2014, 03:59:34 AMIt depends on which version of IMUL is used. The one argument version does fill it. The two and three argument versions do not.

Little demo:

include \masm32\MasmBasic\MasmBasic.inc ; download
Init
mov edx, 87654321
mov ecx, 200
imul eax, ecx, 1000
imul ecx, 50
deb 4, "three+two args, edx unchanged", eax, ecx, edx
mov eax, 123456789
imul ecx
push edx ; we'll display the edx:eax
push eax ; pair as a qword via xmm0
movlps xmm0, QWORD PTR [esp]
deb 4, "one arg, 123456789*10000", edx, eax, ecx, xmm0
add esp, 2*DWORD
Exit
end start

Output:

three+two args, edx unchanged
eax 200000
ecx 10000
edx 87654321

one arg, 123456789*10000
edx 287
eax 1912276048
ecx 10000
xmm0 1234567890000

FORTRANS · May 05, 2014, 06:27:11 AM

Hi Jochen,

Nice demonstration, I may have to try out MasmBasic on an XP
computer here. Amusing that all the numbers were positive.

Looking at IMUL, I wonder if long sequences of multiplies would
be sped up any by using a series of different target registers?

Cheers,

Steve

jj2007 · May 05, 2014, 07:41:09 AM

Quote from: FORTRANS on May 05, 2014, 06:27:11 AM
Hi Jochen,

Nice demonstration, I may have to try out MasmBasic on an XP
computer here. Amusing that all the numbers were positive.

Thanks ;-)

With mov ecx, -200:

three+two args, edx unchanged
eax -200000
ecx -10000
edx 87654321

one arg, 123456789*10000
edx -288
eax -1912276048
ecx -10000
xmm0 -1234567890000

QuoteLooking at IMUL, I wonder if long sequences of multiplies would
be sped up any by using a series of different target registers?

You mean alternating regs? Can you give an example?

FORTRANS · May 05, 2014, 08:15:20 AM

Hi,

Would the first block of code be "noticeably" faster than the
second?

Code Select


       MOV     EAX,[Mem1]
       MOV     EBX,[Mem2]
       MOV     ECX,[Mem3]
       MOV     EDX,[Mem4]
       MOV     EDI,[Mem5]
       MOV     ESI,[Mem6]
       IMUL    EAX,[Arg1]
       IMUL    EBX,[Arg2]
       IMUL    ECX,[Arg3]
       IMUL    EDX,[Arg4]
       IMUL    EDI,[Arg5]
       IMUL    ESI,[Arg6]
       MOV     [Res1],EAX
       MOV     [Res2],EBX
       MOV     [Res3],ECX
       MOV     [Res4],EDX
       MOV     [Res5],EDI
       MOV     [Res6],ESI

       MOV     EAX,[Mem1]
       IMUL    EAX,[Arg1]
       MOV     [Res1],EAX
       MOV     EAX,[Mem2]
       IMUL    EAX,[Arg2]
       MOV     [Res2],EAX
       MOV     EAX,[Mem3]
       IMUL    EAX,[Arg3]
       MOV     [Res3],EAX
       MOV     EAX,[Mem4]
       IMUL    EAX,[Arg4]
       MOV     [Res4],EAX
       MOV     EAX,[Mem5]
       IMUL    EAX,[Arg5]
       MOV     [Res5],EAX
       MOV     EAX,[Mem6]
       IMUL    EAX,[Arg6]
       MOV     [Res6],EAX

Regards,

Steve N.

jj2007 · May 05, 2014, 04:51:30 PM

Quote from: FORTRANS on May 05, 2014, 08:15:20 AM
Would the first block of code be "noticeably" faster than the
second?

"Noticeably" indeed, Steve: one cycle less. It's ten bytes longer, though, so occasionally the cache might make block B the better solution.

AMD Athlon(tm) Dual Core Processor 4450B (SSE3)

1520 cycles for 100 * block A
1610 cycles for 100 * block B

1513 cycles for 100 * block A
1611 cycles for 100 * block B

1513 cycles for 100 * block A
1614 cycles for 100 * block B

111 bytes for block A
101 bytes for block B

The MASM Forum

News:

Mul instruction

Gunther

gelatine1

dedndave

dedndave

jj2007

KeepingRealBusy

FORTRANS

dedndave

jj2007

FORTRANS

jj2007

FORTRANS

jj2007

FORTRANS

jj2007