INC DEC versus ADD SUB 2

nidud · December 30, 2016, 09:40:18 AM

deleted

hutch-- · December 30, 2016, 11:45:35 AM

JJ,

Funny results on my Haswell.

Intel(R) Core(TM) i7-5820K CPU @ 3.30GHz (SSE4)

?? cycles for 1000 * dec
?? cycles for 1000 * sub

?? cycles for 1000 * dec
?? cycles for 1000 * sub

?? cycles for 1000 * dec
?? cycles for 1000 * sub

?? cycles for 1000 * dec
?? cycles for 1000 * sub

?? cycles for 1000 * dec
?? cycles for 1000 * sub

0 bytes for dec
2 bytes for sub

--- ok ---

jj2007 · December 30, 2016, 07:42:29 PM

Quote from: hutch-- on December 30, 2016, 11:45:35 AM
Funny results on my Haswell.

Try this one.

Code Select

NameA equ dec	; assign a descriptive name here
TestA proc
  mov ebx, AlgoLoops-1	; loop e.g. 100x
  align 4
  .Repeat
	dec eax
	dec ecx
	dec edx
	dec eax
	dec ecx
	dec edx
	dec eax
	dec ecx
	dec edx
	dec ebx
  .Until Sign?
  ret
TestA endp
TestA_endp:

align_64
TestB_s:
NameB equ sub	; assign a descriptive name here
TestB proc
  mov ebx, AlgoLoops-1	; loop e.g. 100x
  align 4
  .Repeat
	sub eax, 1
	sub ecx, 1
	sub edx, 1
	sub eax, 1
	sub ecx, 1
	sub edx, 1
	sub eax, 1
	sub ecx, 1
	sub edx, 1
	sub ebx, 1
  .Until Sign?
  ret
TestB endp
TestB_endp:

TWell · December 30, 2016, 08:39:14 PM

Code Select

AMD Athlon(tm) II X2 220 Processor (SSE3)

2030    cycles for 1000 * dec
2028    cycles for 1000 * sub

2129    cycles for 1000 * dec
2034    cycles for 1000 * sub

2031    cycles for 1000 * dec
2027    cycles for 1000 * sub

2038    cycles for 1000 * dec
2026    cycles for 1000 * sub

2030    cycles for 1000 * dec
2028    cycles for 1000 * sub

9       bytes for dec
29      bytes for sub

hutch-- · December 30, 2016, 08:53:46 PM

The later version at least produces numbers.

Intel(R) Core(TM) i7-5820K CPU @ 3.30GHz (SSE4)

2368 cycles for 1000 * dec
2351 cycles for 1000 * sub

2359 cycles for 1000 * dec
2346 cycles for 1000 * sub

2355 cycles for 1000 * dec
2348 cycles for 1000 * sub

2357 cycles for 1000 * dec
2354 cycles for 1000 * sub

2361 cycles for 1000 * dec
2349 cycles for 1000 * sub

9 bytes for dec
29 bytes for sub

--- ok ---

There is an unusual effect with this Haswell, when its idling the processor drops the frequency from 3.3 gig to about 1.2 gig so when I time anything I run a load first to get it to come up to speed then time the algo.

Now back to ADD SUB versus INC DEC, Intel say that ADD SUB is faster because it does not set a flag so while the simplified timing produces similar results, the real test is in an optimised algo where there is an interaction between instruction choices. It has been this way since the end of the .386 with pipelines, out of order execution and so on. Strangely enough Intel usually do know what they are talking about.

None of this matters with AMD hardware which is internally different to Intel processors.

jj2007 · December 30, 2016, 10:03:51 PM

Quote from: hutch-- on December 30, 2016, 08:53:46 PM
The later version at least produces numbers.
...
Intel say that ADD SUB is faster because it does not set a flag so while the simplified timing produces similar results, the real test is in an optimised algo where there is an interaction between instruction choices.

I've been waiting some years now for a real world example of this mysterious behaviour. In the meantime, as long as this example doesn't pop up, I will continue to use the instruction cache friendly inc and dec instructions. Btw, please shove this intermezzo into the Lab - it is unfair to Habran & Johnsa to hijack their thread. My fault 8)

Code Select

Intel(R) Core(TM) i5-2450M CPU @ 2.50GHz (SSE4)

2963    cycles for 1000 * dec
2954    cycles for 1000 * sub

2962    cycles for 1000 * dec
2956    cycles for 1000 * sub

One more - a desperate attempt to insert all kinds of dependencies:

Code Select

  .Repeat
	dec eax
	mov eax, dword ptr somestring
	dec ecx
	mov eax, dword ptr somestring
	dec edx
	je @F
	dec eax
	push eax
	mov dword ptr somestring, eax
	dec ecx
	mov eax, dword ptr somestring
	pop edx
	sar edx, 1
	dec edx
@@:	dec eax
	mov dword ptr somestring, eax
	dec ecx
	mov eax, dword ptr somestring
	dec edx
	dec ebx
  .Until Sign?

The sub reg, 1 version looks identical, it is just much, much longer. And the results are identical on my CPU:

Code Select

Intel(R) Core(TM) i5-2450M CPU @ 2.50GHz (SSE4)

12890   cycles for 1000 * dec
12909   cycles for 1000 * sub

12911   cycles for 1000 * dec
12919   cycles for 1000 * sub

12892   cycles for 1000 * dec
12892   cycles for 1000 * sub

12890   cycles for 1000 * dec
12893   cycles for 1000 * sub

12884   cycles for 1000 * dec
12894   cycles for 1000 * sub

45      bytes for dec
65      bytes for sub

My personal conclusion would be "if you still own a Prescott P4, and you need incredibly fast code for this machine, use sub reg, 1 instead of dec reg", after having carefully studied the explanations of Gilgamesz posted December 7, 2016:

QuoteThis is stale optimization advice left over from Pentium 4 ... Some modern compilers (including clang-3.8, and Intel's ICC 13) do use inc when optimizing for speed (-O3), not just for size

New version attached. I've even included the example by arafel posted here a bit more than ten years ago, but no luck with my strange Core i5, it just refuses to honour these theories :(

FORTRANS · December 31, 2016, 01:17:46 AM

Hi,

Here is a quick result jj2007.

Code Select

Intel(R) Pentium(R) M processor 1.70GHz (SSE2)

15825	cycles for 1000 * dec
24592	cycles for 1000 * sub

15828	cycles for 1000 * dec
24590	cycles for 1000 * sub

15839	cycles for 1000 * dec
24587	cycles for 1000 * sub

15821	cycles for 1000 * dec
24583	cycles for 1000 * sub

15832	cycles for 1000 * dec
24581	cycles for 1000 * sub

56	bytes for dec
74	bytes for sub


--- ok ---

Won't run on older processors.

HTH,

Steve N.

nidud · December 31, 2016, 01:37:36 AM

deleted

nidud · December 31, 2016, 01:49:33 AM

deleted

hutch-- · December 31, 2016, 01:59:50 AM

I cannot get a meaningful difference with a test piece like as folows.

; ¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤
include \masm32\include\masm32rt.inc
.686p
; ¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤

.code
start:

call main
inkey
exit

; ¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤

main proc

LOCAL tc :DWORD

; ----
; load
; ----
mov edx, 4000000000
@@:
sub edx, 1
jnz @B

cpuid ; instruction cache flush
invoke SleepEx,90,0 ; wait for some timeslices

; ----
; time
; ----
invoke GetTickCount
push eax

mov edx, 4000000000

@@:
REPEAT 3 ; unroll by 4
sub edx, 1
; dec edx
jz @F
ENDM
sub edx, 1
; dec edx
jnz @B
@@:

invoke GetTickCount
pop ecx
sub eax, ecx

print str$(eax)," timing ms",13,10

ret

main endp

; ¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤

end start

ragdog · December 31, 2016, 04:29:17 AM

Code Select


AMD Athlon(tm) II P360 Dual-Core Processor (SSE3)

6571    cycles for 1000 * dec
5825    cycles for 1000 * sub

5862    cycles for 1000 * dec
5824    cycles for 1000 * sub

5801    cycles for 1000 * dec
5770    cycles for 1000 * sub

5834    cycles for 1000 * dec
5874    cycles for 1000 * sub

5939    cycles for 1000 * dec
5810    cycles for 1000 * sub

9       bytes for dec
29      bytes for sub


--- ok ---

daydreamer · February 16, 2017, 12:25:09 AM

Why dont discuss test new instructions vs old ?
Like sse floats/sse2 integer packed instruction
That turn out to be useful when code parallel instructions

The MASM Forum

News:

INC DEC versus ADD SUB 2

nidud

hutch--

jj2007

TWell

hutch--

jj2007

FORTRANS

nidud

nidud

hutch--

ragdog

daydreamer