deleted
JJ,
Funny results on my Haswell.
Intel(R) Core(TM) i7-5820K CPU @ 3.30GHz (SSE4)
?? cycles for 1000 * dec
?? cycles for 1000 * sub
?? cycles for 1000 * dec
?? cycles for 1000 * sub
?? cycles for 1000 * dec
?? cycles for 1000 * sub
?? cycles for 1000 * dec
?? cycles for 1000 * sub
?? cycles for 1000 * dec
?? cycles for 1000 * sub
0 bytes for dec
2 bytes for sub
--- ok ---
Quote from: hutch-- on December 30, 2016, 11:45:35 AM
Funny results on my Haswell.
Try this one.
NameA equ dec ; assign a descriptive name here
TestA proc
mov ebx, AlgoLoops-1 ; loop e.g. 100x
align 4
.Repeat
dec eax
dec ecx
dec edx
dec eax
dec ecx
dec edx
dec eax
dec ecx
dec edx
dec ebx
.Until Sign?
ret
TestA endp
TestA_endp:
align_64
TestB_s:
NameB equ sub ; assign a descriptive name here
TestB proc
mov ebx, AlgoLoops-1 ; loop e.g. 100x
align 4
.Repeat
sub eax, 1
sub ecx, 1
sub edx, 1
sub eax, 1
sub ecx, 1
sub edx, 1
sub eax, 1
sub ecx, 1
sub edx, 1
sub ebx, 1
.Until Sign?
ret
TestB endp
TestB_endp:
AMD Athlon(tm) II X2 220 Processor (SSE3)
2030 cycles for 1000 * dec
2028 cycles for 1000 * sub
2129 cycles for 1000 * dec
2034 cycles for 1000 * sub
2031 cycles for 1000 * dec
2027 cycles for 1000 * sub
2038 cycles for 1000 * dec
2026 cycles for 1000 * sub
2030 cycles for 1000 * dec
2028 cycles for 1000 * sub
9 bytes for dec
29 bytes for sub
The later version at least produces numbers. :biggrin:
Intel(R) Core(TM) i7-5820K CPU @ 3.30GHz (SSE4)
2368 cycles for 1000 * dec
2351 cycles for 1000 * sub
2359 cycles for 1000 * dec
2346 cycles for 1000 * sub
2355 cycles for 1000 * dec
2348 cycles for 1000 * sub
2357 cycles for 1000 * dec
2354 cycles for 1000 * sub
2361 cycles for 1000 * dec
2349 cycles for 1000 * sub
9 bytes for dec
29 bytes for sub
--- ok ---
There is an unusual effect with this Haswell, when its idling the processor drops the frequency from 3.3 gig to about 1.2 gig so when I time anything I run a load first to get it to come up to speed then time the algo.
Now back to ADD SUB versus INC DEC, Intel say that ADD SUB is faster because it does not set a flag so while the simplified timing produces similar results, the real test is in an optimised algo where there is an interaction between instruction choices. It has been this way since the end of the .386 with pipelines, out of order execution and so on. Strangely enough Intel usually do know what they are talking about.
None of this matters with AMD hardware which is internally different to Intel processors.
Quote from: hutch-- on December 30, 2016, 08:53:46 PM
The later version at least produces numbers. :biggrin:
...
Intel say that ADD SUB is faster because it does not set a flag so while the simplified timing produces similar results, the real test is in an optimised algo where there is an interaction between instruction choices.
I've been waiting some years now for a real world example of this mysterious behaviour. In the meantime, as long as this example doesn't pop up, I will continue to use the instruction cache friendly inc and dec instructions. Btw, please shove this intermezzo into the Lab - it is unfair to Habran & Johnsa to hijack their thread. My fault 8)
Intel(R) Core(TM) i5-2450M CPU @ 2.50GHz (SSE4)
2963 cycles for 1000 * dec
2954 cycles for 1000 * sub
2962 cycles for 1000 * dec
2956 cycles for 1000 * sub
One more - a desperate attempt to insert all kinds of dependencies:
.Repeat
dec eax
mov eax, dword ptr somestring
dec ecx
mov eax, dword ptr somestring
dec edx
je @F
dec eax
push eax
mov dword ptr somestring, eax
dec ecx
mov eax, dword ptr somestring
pop edx
sar edx, 1
dec edx
@@: dec eax
mov dword ptr somestring, eax
dec ecx
mov eax, dword ptr somestring
dec edx
dec ebx
.Until Sign?
The sub reg, 1 version looks identical, it is just much, much longer. And the results are identical on my CPU:
Intel(R) Core(TM) i5-2450M CPU @ 2.50GHz (SSE4)
12890 cycles for 1000 * dec
12909 cycles for 1000 * sub
12911 cycles for 1000 * dec
12919 cycles for 1000 * sub
12892 cycles for 1000 * dec
12892 cycles for 1000 * sub
12890 cycles for 1000 * dec
12893 cycles for 1000 * sub
12884 cycles for 1000 * dec
12894 cycles for 1000 * sub
45 bytes for dec
65 bytes for sub
My personal conclusion would be "if you still own a Prescott P4, and you need incredibly fast code for this machine, use
sub reg, 1 instead of
dec reg",
after having carefully studied the explanations of Gilgamesz (http://asktoanswer.com/questions/inc-instruction-vs-add-1-does-it-matter/) posted December 7, 2016:
QuoteThis is stale optimization advice left over from Pentium 4 ... Some modern compilers (including clang-3.8, and Intel's ICC 13) do use inc when optimizing for speed (-O3), not just for size
New version attached. I've even included the example by arafel (http://www.masmforum.com/board/index.php?topic=4662.msg34911#msg34911) posted here a bit more than ten years ago, but no luck with my strange Core i5, it just refuses to honour these theories :(
Hi,
Here is a quick result jj2007.
Intel(R) Pentium(R) M processor 1.70GHz (SSE2)
15825 cycles for 1000 * dec
24592 cycles for 1000 * sub
15828 cycles for 1000 * dec
24590 cycles for 1000 * sub
15839 cycles for 1000 * dec
24587 cycles for 1000 * sub
15821 cycles for 1000 * dec
24583 cycles for 1000 * sub
15832 cycles for 1000 * dec
24581 cycles for 1000 * sub
56 bytes for dec
74 bytes for sub
--- ok ---
Won't run on older processors.
HTH,
Steve N.
deleted
deleted
I cannot get a meaningful difference with a test piece like as folows.
; ¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤
include \masm32\include\masm32rt.inc
.686p
; ¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤
.code
start:
call main
inkey
exit
; ¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤
main proc
LOCAL tc :DWORD
; ----
; load
; ----
mov edx, 4000000000
@@:
sub edx, 1
jnz @B
cpuid ; instruction cache flush
invoke SleepEx,90,0 ; wait for some timeslices
; ----
; time
; ----
invoke GetTickCount
push eax
mov edx, 4000000000
@@:
REPEAT 3 ; unroll by 4
sub edx, 1
; dec edx
jz @F
ENDM
sub edx, 1
; dec edx
jnz @B
@@:
invoke GetTickCount
pop ecx
sub eax, ecx
print str$(eax)," timing ms",13,10
ret
main endp
; ¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤
end start
AMD Athlon(tm) II P360 Dual-Core Processor (SSE3)
6571 cycles for 1000 * dec
5825 cycles for 1000 * sub
5862 cycles for 1000 * dec
5824 cycles for 1000 * sub
5801 cycles for 1000 * dec
5770 cycles for 1000 * sub
5834 cycles for 1000 * dec
5874 cycles for 1000 * sub
5939 cycles for 1000 * dec
5810 cycles for 1000 * sub
9 bytes for dec
29 bytes for sub
--- ok ---
Why dont discuss test new instructions vs old ?
Like sse floats/sse2 integer packed instruction
That turn out to be useful when code parallel instructions