News:

Masm32 SDK description, downloads and other helpful links
Message to All Guests
NB: Posting URL's See here: Posted URL Change

Main Menu

Which is "better"?

Started by CCurl, October 15, 2015, 01:19:08 PM

Previous topic - Next topic

CCurl

Is one "better" than the other, and why?
I understand that register usage is more efficient, but the 2nd approach uses more instructions, so is it really that much better (faster)?


lea  esi, retStack
mov  DWORD PTR [esi], 0

or

xor  eax, eax
lea  esi, retStack
mov  [esi], eax


dedndave

first, use OFFSET when you can
    mov     esi,offset retStack  ;if global variable
if it's a local variable, LEA is the right way

as for 0 or a reg value, it kind of depends - lol

    mov dword ptr [esi],0

by nature, this is the fastest one, but it's also the largest one (the immediate 0 operand is 4 bytes)
inside a loop, the smallest code may sometimes be the faster choice

and - it depends because sometimes, you have a register that is 0 handy - sometimes you don't

CCurl


sinsi

Half the size of mov, possibly as fast?
and dword ptr [esi], 0

jj2007

I was about to write "speed-wise, they are all the same", but then I decided to test it - massive surprise:
Intel(R) Core(TM) i5-2450M CPU @ 2.50GHz (SSE4)

30      cycles for 100 * mov rstack, 0
588     cycles for 100 * and rstack, 0
28      cycles for 100 * mov dword ptr [esi], 0
573     cycles for 100 * and dword ptr [esi], 0
50      cycles for 100 * mov [esi], eax

27      cycles for 100 * mov rstack, 0
583     cycles for 100 * and rstack, 0
26      cycles for 100 * mov dword ptr [esi], 0
571     cycles for 100 * and dword ptr [esi], 0
50      cycles for 100 * mov [esi], eax

10      bytes for mov rstack, 0
7       bytes for and rstack, 0
11      bytes for mov dword ptr [esi], 0
8       bytes for and dword ptr [esi], 0
9       bytes for mov [esi], eax


So apparently, there is a stall problem. Let's see some other CPUs...

hutch--

 :biggrin:


Intel(R) Core(TM) i7 CPU         860  @ 2.80GHz (SSE4)

??      cycles for 100 * mov rstack, 0
355     cycles for 100 * and rstack, 0
??      cycles for 100 * mov dword ptr [esi], 0
355     cycles for 100 * and dword ptr [esi], 0
1       cycles for 100 * mov [esi], eax

??      cycles for 100 * mov rstack, 0
355     cycles for 100 * and rstack, 0
??      cycles for 100 * mov dword ptr [esi], 0
356     cycles for 100 * and dword ptr [esi], 0
??      cycles for 100 * mov [esi], eax

??      cycles for 100 * mov rstack, 0
355     cycles for 100 * and rstack, 0
??      cycles for 100 * mov dword ptr [esi], 0
355     cycles for 100 * and dword ptr [esi], 0
??      cycles for 100 * mov [esi], eax

??      cycles for 100 * mov rstack, 0
355     cycles for 100 * and rstack, 0
??      cycles for 100 * mov dword ptr [esi], 0
354     cycles for 100 * and dword ptr [esi], 0
??      cycles for 100 * mov [esi], eax

??      cycles for 100 * mov rstack, 0
355     cycles for 100 * and rstack, 0
??      cycles for 100 * mov dword ptr [esi], 0
355     cycles for 100 * and dword ptr [esi], 0
??      cycles for 100 * mov [esi], eax

10      bytes for mov rstack, 0
7       bytes for and rstack, 0
11      bytes for mov dword ptr [esi], 0
8       bytes for and dword ptr [esi], 0
9       bytes for mov [esi], eax


--- ok ---


jj2007

#6
Hutch,

?? means negative results, may disappear when testing again. Your i7 is too fast ;-)
You may try increasing the counter in the SpinUp macro: mov eax, 200000000

In any case, it's strange that we never stumbled over this one. Incredibly slow, but I have a suspicion this is Core in specific... voilĂ , Celeron behaves differently:
Intel(R) Celeron(R) CPU  N2840  @ 2.16GHz (SSE4)

26      cycles for 100 * mov rstack, 0
315     cycles for 100 * and rstack, 0
83      cycles for 100 * mov dword ptr [esi], 0
85      cycles for 100 * and dword ptr [esi], 0
99      cycles for 100 * mov [esi], eax
328     cycles for 100 * mov dword ptr [esi], eax (preserved)

26      cycles for 100 * mov rstack, 0
315     cycles for 100 * and rstack, 0
83      cycles for 100 * mov dword ptr [esi], 0
85      cycles for 100 * and dword ptr [esi], 0
97      cycles for 100 * mov [esi], eax
331     cycles for 100 * mov dword ptr [esi], eax (preserved)

26      cycles for 100 * mov rstack, 0
315     cycles for 100 * and rstack, 0
83      cycles for 100 * mov dword ptr [esi], 0
85      cycles for 100 * and dword ptr [esi], 0
99      cycles for 100 * mov [esi], eax
330     cycles for 100 * mov dword ptr [esi], eax (preserved)

26      cycles for 100 * mov rstack, 0
315     cycles for 100 * and rstack, 0
83      cycles for 100 * mov dword ptr [esi], 0
85      cycles for 100 * and dword ptr [esi], 0
99      cycles for 100 * mov [esi], eax
331     cycles for 100 * mov dword ptr [esi], eax (preserved)

26      cycles for 100 * mov rstack, 0
316     cycles for 100 * and rstack, 0
84      cycles for 100 * mov dword ptr [esi], 0
85      cycles for 100 * and dword ptr [esi], 0
99      cycles for 100 * mov [esi], eax
330     cycles for 100 * mov dword ptr [esi], eax (preserved)

10      bytes for mov rstack, 0
8       bytes for and rstack, 0
11      bytes for mov dword ptr [esi], 0
8       bytes for and dword ptr [esi], 0
9       bytes for mov [esi], eax
11      bytes for mov dword ptr [esi], eax (preserved)

TWell

AMD E-450 APU with Radeon(tm) HD Graphics (SSE4)

8       cycles for 100 * mov rstack, 0
387     cycles for 100 * and rstack, 0
6       cycles for 100 * mov dword ptr [esi], 0
412     cycles for 100 * and dword ptr [esi], 0
101     cycles for 100 * mov [esi], eax

11      cycles for 100 * mov rstack, 0
391     cycles for 100 * and rstack, 0
15      cycles for 100 * mov dword ptr [esi], 0
393     cycles for 100 * and dword ptr [esi], 0
94      cycles for 100 * mov [esi], eax

4       cycles for 100 * mov rstack, 0
385     cycles for 100 * and rstack, 0
13      cycles for 100 * mov dword ptr [esi], 0
386     cycles for 100 * and dword ptr [esi], 0
93      cycles for 100 * mov [esi], eax

6       cycles for 100 * mov rstack, 0
391     cycles for 100 * and rstack, 0
10      cycles for 100 * mov dword ptr [esi], 0
386     cycles for 100 * and dword ptr [esi], 0
100     cycles for 100 * mov [esi], eax

4       cycles for 100 * mov rstack, 0
383     cycles for 100 * and rstack, 0
7       cycles for 100 * mov dword ptr [esi], 0
386     cycles for 100 * and dword ptr [esi], 0
92      cycles for 100 * mov [esi], eax

10      bytes for mov rstack, 0
7       bytes for and rstack, 0
11      bytes for mov dword ptr [esi], 0
8       bytes for and dword ptr [esi], 0
9       bytes for mov [esi], eax

Mikl__

Intel(R) Pentium(R) CPU G860 @ 3.00GHz (SSE4)

8 cycles for 100 * mov rstack, 0
590 cycles for 100 * and rstack, 0
20 cycles for 100 * mov dword ptr [esi], 0
587 cycles for 100 * and dword ptr [esi], 0
68 cycles for 100 * mov [esi], eax

20 cycles for 100 * mov rstack, 0
591 cycles for 100 * and rstack, 0
20 cycles for 100 * mov dword ptr [esi], 0
565 cycles for 100 * and dword ptr [esi], 0
68 cycles for 100 * mov [esi], eax

18 cycles for 100 * mov rstack, 0
593 cycles for 100 * and rstack, 0
20 cycles for 100 * mov dword ptr [esi], 0
562 cycles for 100 * and dword ptr [esi], 0
68 cycles for 100 * mov [esi], eax

14 cycles for 100 * mov rstack, 0
589 cycles for 100 * and rstack, 0
20 cycles for 100 * mov dword ptr [esi], 0
562 cycles for 100 * and dword ptr [esi], 0
69 cycles for 100 * mov [esi], eax

19 cycles for 100 * mov rstack, 0
586 cycles for 100 * and rstack, 0
19 cycles for 100 * mov dword ptr [esi], 0
565 cycles for 100 * and dword ptr [esi], 0
67 cycles for 100 * mov [esi], eax

10 bytes for mov rstack, 0
7 bytes for and rstack, 0
11 bytes for mov dword ptr [esi], 0
8 bytes for and dword ptr [esi], 0
9 bytes for mov [esi], eax

jj2007

#9
Thanks :t

I found a compromise for and dword ptr [esi]: push 0, pop dword ptr [esi]. Takes 0.7 instead of 5 cycles:
Intel(R) Core(TM) i5-2450M CPU @ 2.50GHz (SSE4)

30      cycles for 100 * mov rstack, 0
584     cycles for 100 * and rstack, 0
29      cycles for 100 * mov dword ptr [esi], 0
570     cycles for 100 * and dword ptr [esi], 0
49      cycles for 100 * mov [esi], eax (trashed)
410     cycles for 100 * mov dword ptr [esi], eax (preserved)
69      cycles for 100 * push 0/pop dword ptr [esi]

30      cycles for 100 * mov rstack, 0
585     cycles for 100 * and rstack, 0
27      cycles for 100 * mov dword ptr [esi], 0
570     cycles for 100 * and dword ptr [esi], 0
51      cycles for 100 * mov [esi], eax (trashed)
412     cycles for 100 * mov dword ptr [esi], eax (preserved)
71      cycles for 100 * push 0/pop dword ptr [esi]

31      cycles for 100 * mov rstack, 0
584     cycles for 100 * and rstack, 0
29      cycles for 100 * mov dword ptr [esi], 0
569     cycles for 100 * and dword ptr [esi], 0
49      cycles for 100 * mov [esi], eax (trashed)
410     cycles for 100 * mov dword ptr [esi], eax (preserved)
68      cycles for 100 * push 0/pop dword ptr [esi]

10      bytes for mov rstack, 0
8       bytes for and rstack, 0
11      bytes for mov dword ptr [esi], 0
8       bytes for and dword ptr [esi], 0
9       bytes for mov [esi], eax (trashed)
11      bytes for mov dword ptr [esi], eax (preserved)
9       bytes for push 0/pop dword ptr [esi]


Intel(R) Celeron(R) M CPU        420  @ 1.60GHz (SSE3)
86      cycles for 100 * mov rstack, 0
979     cycles for 100 * and rstack, 0
98      cycles for 100 * mov dword ptr [esi], 0
979     cycles for 100 * and dword ptr [esi], 0
98      cycles for 100 * mov [esi], eax (trashed)
381     cycles for 100 * mov dword ptr [esi], eax (preserved)
248     cycles for 100 * push 0/pop dword ptr [esi]

86      cycles for 100 * mov rstack, 0
979     cycles for 100 * and rstack, 0
98      cycles for 100 * mov dword ptr [esi], 0
979     cycles for 100 * and dword ptr [esi], 0
96      cycles for 100 * mov [esi], eax (trashed)
382     cycles for 100 * mov dword ptr [esi], eax (preserved)
248     cycles for 100 * push 0/pop dword ptr [esi]

86      cycles for 100 * mov rstack, 0
979     cycles for 100 * and rstack, 0
98      cycles for 100 * mov dword ptr [esi], 0
979     cycles for 100 * and dword ptr [esi], 0
96      cycles for 100 * mov [esi], eax (trashed)
383     cycles for 100 * mov dword ptr [esi], eax (preserved)
248     cycles for 100 * push 0/pop dword ptr [esi]

10      bytes for mov rstack, 0
8       bytes for and rstack, 0
11      bytes for mov dword ptr [esi], 0
8       bytes for and dword ptr [esi], 0
9       bytes for mov [esi], eax (trashed)
11      bytes for mov dword ptr [esi], eax (preserved)
9       bytes for push 0/pop dword ptr [esi]

nidud

#10
deleted

FORTRANS

{P=III}
pre-P4 (SSE1)

100 cycles for 100 * mov rstack, 0
1003 cycles for 100 * and rstack, 0
100 cycles for 100 * mov dword ptr [esi], 0
993 cycles for 100 * and dword ptr [esi], 0
103 cycles for 100 * mov [esi], eax (trashed)
205 cycles for 100 * mov dword ptr [esi], eax (preserved)
406 cycles for 100 * push 0/pop dword ptr [esi]

100 cycles for 100 * mov rstack, 0
999 cycles for 100 * and rstack, 0
100 cycles for 100 * mov dword ptr [esi], 0
994 cycles for 100 * and dword ptr [esi], 0
111 cycles for 100 * mov [esi], eax (trashed)
204 cycles for 100 * mov dword ptr [esi], eax (preserved)
406 cycles for 100 * push 0/pop dword ptr [esi]

100 cycles for 100 * mov rstack, 0
1000 cycles for 100 * and rstack, 0
102 cycles for 100 * mov dword ptr [esi], 0
994 cycles for 100 * and dword ptr [esi], 0
103 cycles for 100 * mov [esi], eax (trashed)
204 cycles for 100 * mov dword ptr [esi], eax (preserved)
407 cycles for 100 * push 0/pop dword ptr [esi]

10 bytes for mov rstack, 0
8 bytes for and rstack, 0
11 bytes for mov dword ptr [esi], 0
8 bytes for and dword ptr [esi], 0
9 bytes for mov [esi], eax (trashed)
11 bytes for mov dword ptr [esi], eax (preserved)
9 bytes for push 0/pop dword ptr [esi]


--- ok --- Intel(R) Pentium(R) M processor 1.70GHz (SSE2)

87 cycles for 100 * mov rstack, 0
991 cycles for 100 * and rstack, 0
101 cycles for 100 * mov dword ptr [esi], 0
988 cycles for 100 * and dword ptr [esi], 0
98 cycles for 100 * mov [esi], eax (trashed)
394 cycles for 100 * mov dword ptr [esi], eax (preserved)
250 cycles for 100 * push 0/pop dword ptr [esi]

89 cycles for 100 * mov rstack, 0
992 cycles for 100 * and rstack, 0
102 cycles for 100 * mov dword ptr [esi], 0
992 cycles for 100 * and dword ptr [esi], 0
98 cycles for 100 * mov [esi], eax (trashed)
395 cycles for 100 * mov dword ptr [esi], eax (preserved)
253 cycles for 100 * push 0/pop dword ptr [esi]

88 cycles for 100 * mov rstack, 0
991 cycles for 100 * and rstack, 0
102 cycles for 100 * mov dword ptr [esi], 0
995 cycles for 100 * and dword ptr [esi], 0
97 cycles for 100 * mov [esi], eax (trashed)
395 cycles for 100 * mov dword ptr [esi], eax (preserved)
251 cycles for 100 * push 0/pop dword ptr [esi]

10 bytes for mov rstack, 0
8 bytes for and rstack, 0
11 bytes for mov dword ptr [esi], 0
8 bytes for and dword ptr [esi], 0
9 bytes for mov [esi], eax (trashed)
11 bytes for mov dword ptr [esi], eax (preserved)
9 bytes for push 0/pop dword ptr [esi]


--- ok ---

jj2007

Thanks, Steve et al. :icon14:

So the conclusion to CCurl: speedwise, these are the best options:
  xor eax, eax
  mov rstack, eax

  mov rstack, 0

  mov esi, offset rstack
  mov dword ptr [esi], 0

  mov esi, offset rstack
  xor eax, eax
  mov dword ptr [esi], eax

CCurl

You folks are very thorough ... thanks alot!

raymond

I understand the necessity of using the size specifyer "dword ptr" in instructions such as "mov dword ptr [esi],0" where the size to be moved is not evident.

However, I do not understand its use in instructions such as "mov dword ptr [esi],eax". According to the MASM syntax, "mov [esi],eax" would be sufficient (unless you really enjoy the extra typing) because the size is already specified by the size of the register. Are other assemblers more fussy???
Whenever you assume something, you risk being wrong half the time.
https://masm32.com/masmcode/rayfil/index.html