Is one "better" than the other, and why?
I understand that register usage is more efficient, but the 2nd approach uses more instructions, so is it really that much better (faster)?
lea esi, retStack
mov DWORD PTR [esi], 0
or
xor eax, eax
lea esi, retStack
mov [esi], eax
first, use OFFSET when you can
mov esi,offset retStack ;if global variable
if it's a local variable, LEA is the right way
as for 0 or a reg value, it kind of depends - lol
mov dword ptr [esi],0
by nature, this is the fastest one, but it's also the largest one (the immediate 0 operand is 4 bytes)
inside a loop, the smallest code may sometimes be the faster choice
and - it depends because sometimes, you have a register that is 0 handy - sometimes you don't
thanks!
Half the size of mov, possibly as fast?
and dword ptr [esi], 0
I was about to write "speed-wise, they are all the same", but then I decided to test it - massive surprise:
Intel(R) Core(TM) i5-2450M CPU @ 2.50GHz (SSE4)
30 cycles for 100 * mov rstack, 0
588 cycles for 100 * and rstack, 0
28 cycles for 100 * mov dword ptr [esi], 0
573 cycles for 100 * and dword ptr [esi], 0
50 cycles for 100 * mov [esi], eax
27 cycles for 100 * mov rstack, 0
583 cycles for 100 * and rstack, 0
26 cycles for 100 * mov dword ptr [esi], 0
571 cycles for 100 * and dword ptr [esi], 0
50 cycles for 100 * mov [esi], eax
10 bytes for mov rstack, 0
7 bytes for and rstack, 0
11 bytes for mov dword ptr [esi], 0
8 bytes for and dword ptr [esi], 0
9 bytes for mov [esi], eax
So apparently, there is a stall problem. Let's see some other CPUs...
:biggrin:
Intel(R) Core(TM) i7 CPU 860 @ 2.80GHz (SSE4)
?? cycles for 100 * mov rstack, 0
355 cycles for 100 * and rstack, 0
?? cycles for 100 * mov dword ptr [esi], 0
355 cycles for 100 * and dword ptr [esi], 0
1 cycles for 100 * mov [esi], eax
?? cycles for 100 * mov rstack, 0
355 cycles for 100 * and rstack, 0
?? cycles for 100 * mov dword ptr [esi], 0
356 cycles for 100 * and dword ptr [esi], 0
?? cycles for 100 * mov [esi], eax
?? cycles for 100 * mov rstack, 0
355 cycles for 100 * and rstack, 0
?? cycles for 100 * mov dword ptr [esi], 0
355 cycles for 100 * and dword ptr [esi], 0
?? cycles for 100 * mov [esi], eax
?? cycles for 100 * mov rstack, 0
355 cycles for 100 * and rstack, 0
?? cycles for 100 * mov dword ptr [esi], 0
354 cycles for 100 * and dword ptr [esi], 0
?? cycles for 100 * mov [esi], eax
?? cycles for 100 * mov rstack, 0
355 cycles for 100 * and rstack, 0
?? cycles for 100 * mov dword ptr [esi], 0
355 cycles for 100 * and dword ptr [esi], 0
?? cycles for 100 * mov [esi], eax
10 bytes for mov rstack, 0
7 bytes for and rstack, 0
11 bytes for mov dword ptr [esi], 0
8 bytes for and dword ptr [esi], 0
9 bytes for mov [esi], eax
--- ok ---
Hutch,
?? means negative results, may disappear when testing again. Your i7 is too fast ;-)
You may try increasing the counter in the SpinUp macro: mov eax, 200000000
In any case, it's strange that we never stumbled over this one. Incredibly slow, but I have a suspicion this is Core in specific... voilĂ , Celeron behaves differently:
Intel(R) Celeron(R) CPU N2840 @ 2.16GHz (SSE4)
26 cycles for 100 * mov rstack, 0
315 cycles for 100 * and rstack, 0
83 cycles for 100 * mov dword ptr [esi], 0
85 cycles for 100 * and dword ptr [esi], 0
99 cycles for 100 * mov [esi], eax
328 cycles for 100 * mov dword ptr [esi], eax (preserved)
26 cycles for 100 * mov rstack, 0
315 cycles for 100 * and rstack, 0
83 cycles for 100 * mov dword ptr [esi], 0
85 cycles for 100 * and dword ptr [esi], 0
97 cycles for 100 * mov [esi], eax
331 cycles for 100 * mov dword ptr [esi], eax (preserved)
26 cycles for 100 * mov rstack, 0
315 cycles for 100 * and rstack, 0
83 cycles for 100 * mov dword ptr [esi], 0
85 cycles for 100 * and dword ptr [esi], 0
99 cycles for 100 * mov [esi], eax
330 cycles for 100 * mov dword ptr [esi], eax (preserved)
26 cycles for 100 * mov rstack, 0
315 cycles for 100 * and rstack, 0
83 cycles for 100 * mov dword ptr [esi], 0
85 cycles for 100 * and dword ptr [esi], 0
99 cycles for 100 * mov [esi], eax
331 cycles for 100 * mov dword ptr [esi], eax (preserved)
26 cycles for 100 * mov rstack, 0
316 cycles for 100 * and rstack, 0
84 cycles for 100 * mov dword ptr [esi], 0
85 cycles for 100 * and dword ptr [esi], 0
99 cycles for 100 * mov [esi], eax
330 cycles for 100 * mov dword ptr [esi], eax (preserved)
10 bytes for mov rstack, 0
8 bytes for and rstack, 0
11 bytes for mov dword ptr [esi], 0
8 bytes for and dword ptr [esi], 0
9 bytes for mov [esi], eax
11 bytes for mov dword ptr [esi], eax (preserved)
AMD E-450 APU with Radeon(tm) HD Graphics (SSE4)
8 cycles for 100 * mov rstack, 0
387 cycles for 100 * and rstack, 0
6 cycles for 100 * mov dword ptr [esi], 0
412 cycles for 100 * and dword ptr [esi], 0
101 cycles for 100 * mov [esi], eax
11 cycles for 100 * mov rstack, 0
391 cycles for 100 * and rstack, 0
15 cycles for 100 * mov dword ptr [esi], 0
393 cycles for 100 * and dword ptr [esi], 0
94 cycles for 100 * mov [esi], eax
4 cycles for 100 * mov rstack, 0
385 cycles for 100 * and rstack, 0
13 cycles for 100 * mov dword ptr [esi], 0
386 cycles for 100 * and dword ptr [esi], 0
93 cycles for 100 * mov [esi], eax
6 cycles for 100 * mov rstack, 0
391 cycles for 100 * and rstack, 0
10 cycles for 100 * mov dword ptr [esi], 0
386 cycles for 100 * and dword ptr [esi], 0
100 cycles for 100 * mov [esi], eax
4 cycles for 100 * mov rstack, 0
383 cycles for 100 * and rstack, 0
7 cycles for 100 * mov dword ptr [esi], 0
386 cycles for 100 * and dword ptr [esi], 0
92 cycles for 100 * mov [esi], eax
10 bytes for mov rstack, 0
7 bytes for and rstack, 0
11 bytes for mov dword ptr [esi], 0
8 bytes for and dword ptr [esi], 0
9 bytes for mov [esi], eax
Intel(R) Pentium(R) CPU G860 @ 3.00GHz (SSE4)
8 cycles for 100 * mov rstack, 0
590 cycles for 100 * and rstack, 0
20 cycles for 100 * mov dword ptr [esi], 0
587 cycles for 100 * and dword ptr [esi], 0
68 cycles for 100 * mov [esi], eax
20 cycles for 100 * mov rstack, 0
591 cycles for 100 * and rstack, 0
20 cycles for 100 * mov dword ptr [esi], 0
565 cycles for 100 * and dword ptr [esi], 0
68 cycles for 100 * mov [esi], eax
18 cycles for 100 * mov rstack, 0
593 cycles for 100 * and rstack, 0
20 cycles for 100 * mov dword ptr [esi], 0
562 cycles for 100 * and dword ptr [esi], 0
68 cycles for 100 * mov [esi], eax
14 cycles for 100 * mov rstack, 0
589 cycles for 100 * and rstack, 0
20 cycles for 100 * mov dword ptr [esi], 0
562 cycles for 100 * and dword ptr [esi], 0
69 cycles for 100 * mov [esi], eax
19 cycles for 100 * mov rstack, 0
586 cycles for 100 * and rstack, 0
19 cycles for 100 * mov dword ptr [esi], 0
565 cycles for 100 * and dword ptr [esi], 0
67 cycles for 100 * mov [esi], eax
10 bytes for mov rstack, 0
7 bytes for and rstack, 0
11 bytes for mov dword ptr [esi], 0
8 bytes for and dword ptr [esi], 0
9 bytes for mov [esi], eax
Thanks :t
I found a compromise for and dword ptr [esi]: push 0, pop dword ptr [esi]. Takes 0.7 instead of 5 cycles:
Intel(R) Core(TM) i5-2450M CPU @ 2.50GHz (SSE4)
30 cycles for 100 * mov rstack, 0
584 cycles for 100 * and rstack, 0
29 cycles for 100 * mov dword ptr [esi], 0
570 cycles for 100 * and dword ptr [esi], 0
49 cycles for 100 * mov [esi], eax (trashed)
410 cycles for 100 * mov dword ptr [esi], eax (preserved)
69 cycles for 100 * push 0/pop dword ptr [esi]
30 cycles for 100 * mov rstack, 0
585 cycles for 100 * and rstack, 0
27 cycles for 100 * mov dword ptr [esi], 0
570 cycles for 100 * and dword ptr [esi], 0
51 cycles for 100 * mov [esi], eax (trashed)
412 cycles for 100 * mov dword ptr [esi], eax (preserved)
71 cycles for 100 * push 0/pop dword ptr [esi]
31 cycles for 100 * mov rstack, 0
584 cycles for 100 * and rstack, 0
29 cycles for 100 * mov dword ptr [esi], 0
569 cycles for 100 * and dword ptr [esi], 0
49 cycles for 100 * mov [esi], eax (trashed)
410 cycles for 100 * mov dword ptr [esi], eax (preserved)
68 cycles for 100 * push 0/pop dword ptr [esi]
10 bytes for mov rstack, 0
8 bytes for and rstack, 0
11 bytes for mov dword ptr [esi], 0
8 bytes for and dword ptr [esi], 0
9 bytes for mov [esi], eax (trashed)
11 bytes for mov dword ptr [esi], eax (preserved)
9 bytes for push 0/pop dword ptr [esi]
Intel(R) Celeron(R) M CPU 420 @ 1.60GHz (SSE3)
86 cycles for 100 * mov rstack, 0
979 cycles for 100 * and rstack, 0
98 cycles for 100 * mov dword ptr [esi], 0
979 cycles for 100 * and dword ptr [esi], 0
98 cycles for 100 * mov [esi], eax (trashed)
381 cycles for 100 * mov dword ptr [esi], eax (preserved)
248 cycles for 100 * push 0/pop dword ptr [esi]
86 cycles for 100 * mov rstack, 0
979 cycles for 100 * and rstack, 0
98 cycles for 100 * mov dword ptr [esi], 0
979 cycles for 100 * and dword ptr [esi], 0
96 cycles for 100 * mov [esi], eax (trashed)
382 cycles for 100 * mov dword ptr [esi], eax (preserved)
248 cycles for 100 * push 0/pop dword ptr [esi]
86 cycles for 100 * mov rstack, 0
979 cycles for 100 * and rstack, 0
98 cycles for 100 * mov dword ptr [esi], 0
979 cycles for 100 * and dword ptr [esi], 0
96 cycles for 100 * mov [esi], eax (trashed)
383 cycles for 100 * mov dword ptr [esi], eax (preserved)
248 cycles for 100 * push 0/pop dword ptr [esi]
10 bytes for mov rstack, 0
8 bytes for and rstack, 0
11 bytes for mov dword ptr [esi], 0
8 bytes for and dword ptr [esi], 0
9 bytes for mov [esi], eax (trashed)
11 bytes for mov dword ptr [esi], eax (preserved)
9 bytes for push 0/pop dword ptr [esi]
deleted
{P=III}
pre-P4 (SSE1)
100 cycles for 100 * mov rstack, 0
1003 cycles for 100 * and rstack, 0
100 cycles for 100 * mov dword ptr [esi], 0
993 cycles for 100 * and dword ptr [esi], 0
103 cycles for 100 * mov [esi], eax (trashed)
205 cycles for 100 * mov dword ptr [esi], eax (preserved)
406 cycles for 100 * push 0/pop dword ptr [esi]
100 cycles for 100 * mov rstack, 0
999 cycles for 100 * and rstack, 0
100 cycles for 100 * mov dword ptr [esi], 0
994 cycles for 100 * and dword ptr [esi], 0
111 cycles for 100 * mov [esi], eax (trashed)
204 cycles for 100 * mov dword ptr [esi], eax (preserved)
406 cycles for 100 * push 0/pop dword ptr [esi]
100 cycles for 100 * mov rstack, 0
1000 cycles for 100 * and rstack, 0
102 cycles for 100 * mov dword ptr [esi], 0
994 cycles for 100 * and dword ptr [esi], 0
103 cycles for 100 * mov [esi], eax (trashed)
204 cycles for 100 * mov dword ptr [esi], eax (preserved)
407 cycles for 100 * push 0/pop dword ptr [esi]
10 bytes for mov rstack, 0
8 bytes for and rstack, 0
11 bytes for mov dword ptr [esi], 0
8 bytes for and dword ptr [esi], 0
9 bytes for mov [esi], eax (trashed)
11 bytes for mov dword ptr [esi], eax (preserved)
9 bytes for push 0/pop dword ptr [esi]
--- ok --- Intel(R) Pentium(R) M processor 1.70GHz (SSE2)
87 cycles for 100 * mov rstack, 0
991 cycles for 100 * and rstack, 0
101 cycles for 100 * mov dword ptr [esi], 0
988 cycles for 100 * and dword ptr [esi], 0
98 cycles for 100 * mov [esi], eax (trashed)
394 cycles for 100 * mov dword ptr [esi], eax (preserved)
250 cycles for 100 * push 0/pop dword ptr [esi]
89 cycles for 100 * mov rstack, 0
992 cycles for 100 * and rstack, 0
102 cycles for 100 * mov dword ptr [esi], 0
992 cycles for 100 * and dword ptr [esi], 0
98 cycles for 100 * mov [esi], eax (trashed)
395 cycles for 100 * mov dword ptr [esi], eax (preserved)
253 cycles for 100 * push 0/pop dword ptr [esi]
88 cycles for 100 * mov rstack, 0
991 cycles for 100 * and rstack, 0
102 cycles for 100 * mov dword ptr [esi], 0
995 cycles for 100 * and dword ptr [esi], 0
97 cycles for 100 * mov [esi], eax (trashed)
395 cycles for 100 * mov dword ptr [esi], eax (preserved)
251 cycles for 100 * push 0/pop dword ptr [esi]
10 bytes for mov rstack, 0
8 bytes for and rstack, 0
11 bytes for mov dword ptr [esi], 0
8 bytes for and dword ptr [esi], 0
9 bytes for mov [esi], eax (trashed)
11 bytes for mov dword ptr [esi], eax (preserved)
9 bytes for push 0/pop dword ptr [esi]
--- ok ---
Thanks, Steve et al. :icon14:
So the conclusion to CCurl: speedwise, these are the best options:
xor eax, eax
mov rstack, eax
mov rstack, 0
mov esi, offset rstack
mov dword ptr [esi], 0
mov esi, offset rstack
xor eax, eax
mov dword ptr [esi], eax
You folks are very thorough ... thanks alot!
I understand the necessity of using the size specifyer "dword ptr" in instructions such as "mov dword ptr [esi],0" where the size to be moved is not evident.
However, I do not understand its use in instructions such as "mov dword ptr [esi],eax". According to the MASM syntax, "mov [esi],eax" would be sufficient (unless you really enjoy the extra typing) because the size is already specified by the size of the register. Are other assemblers more fussy???
Quote from: raymond on October 17, 2015, 02:51:26 AM"mov [esi],eax" would be sufficient (unless you really enjoy the extra typing) because the size is already specified by the size of the register. Are other assemblers more fussy???
You are perfectly right, Raymond :t
Re fussiness, JWasm occasionally is a bit more demanding, and in general, caution with SSE and FPU code is a good idea. movq is also a special case with some versions of ML.