News:

Masm32 SDK description, downloads and other helpful links
Message to All Guests

Main Menu

INC DEC versus ADD SUB 2

Started by nidud, December 30, 2016, 09:40:18 AM

Previous topic - Next topic

nidud

deleted

hutch--

JJ,

Funny results on my Haswell.

Intel(R) Core(TM) i7-5820K CPU @ 3.30GHz (SSE4)

??      cycles for 1000 * dec
??      cycles for 1000 * sub

??      cycles for 1000 * dec
??      cycles for 1000 * sub

??      cycles for 1000 * dec
??      cycles for 1000 * sub

??      cycles for 1000 * dec
??      cycles for 1000 * sub

??      cycles for 1000 * dec
??      cycles for 1000 * sub

0       bytes for dec
2       bytes for sub


--- ok ---

jj2007

Quote from: hutch-- on December 30, 2016, 11:45:35 AM
Funny results on my Haswell.

Try this one.NameA equ dec ; assign a descriptive name here
TestA proc
  mov ebx, AlgoLoops-1 ; loop e.g. 100x
  align 4
  .Repeat
dec eax
dec ecx
dec edx
dec eax
dec ecx
dec edx
dec eax
dec ecx
dec edx
dec ebx
  .Until Sign?
  ret
TestA endp
TestA_endp:

align_64
TestB_s:
NameB equ sub ; assign a descriptive name here
TestB proc
  mov ebx, AlgoLoops-1 ; loop e.g. 100x
  align 4
  .Repeat
sub eax, 1
sub ecx, 1
sub edx, 1
sub eax, 1
sub ecx, 1
sub edx, 1
sub eax, 1
sub ecx, 1
sub edx, 1
sub ebx, 1
  .Until Sign?
  ret
TestB endp
TestB_endp:

TWell

AMD Athlon(tm) II X2 220 Processor (SSE3)

2030    cycles for 1000 * dec
2028    cycles for 1000 * sub

2129    cycles for 1000 * dec
2034    cycles for 1000 * sub

2031    cycles for 1000 * dec
2027    cycles for 1000 * sub

2038    cycles for 1000 * dec
2026    cycles for 1000 * sub

2030    cycles for 1000 * dec
2028    cycles for 1000 * sub

9       bytes for dec
29      bytes for sub

hutch--

The later version at least produces numbers.  :biggrin:

Intel(R) Core(TM) i7-5820K CPU @ 3.30GHz (SSE4)

2368    cycles for 1000 * dec
2351    cycles for 1000 * sub

2359    cycles for 1000 * dec
2346    cycles for 1000 * sub

2355    cycles for 1000 * dec
2348    cycles for 1000 * sub

2357    cycles for 1000 * dec
2354    cycles for 1000 * sub

2361    cycles for 1000 * dec
2349    cycles for 1000 * sub

9       bytes for dec
29      bytes for sub


--- ok ---


There is an unusual effect with this Haswell, when its idling the processor drops the frequency from 3.3 gig to about 1.2 gig so when I time anything I run a load first to get it to come up to speed then time the algo.

Now back to ADD SUB versus INC DEC, Intel say that ADD SUB is faster because it does not set a flag so while the simplified timing produces similar results, the real test is in an optimised algo where there is an interaction between instruction choices. It has been this way since the end of the .386 with pipelines, out of order execution and so on. Strangely enough Intel usually do know what they are talking about.

None of this matters with AMD hardware which is internally different to Intel processors.

jj2007

#5
Quote from: hutch-- on December 30, 2016, 08:53:46 PM
The later version at least produces numbers.  :biggrin:
...
Intel say that ADD SUB is faster because it does not set a flag so while the simplified timing produces similar results, the real test is in an optimised algo where there is an interaction between instruction choices.

I've been waiting some years now for a real world example of this mysterious behaviour. In the meantime, as long as this example doesn't pop up, I will continue to use the instruction cache friendly inc and dec instructions. Btw, please shove this intermezzo into the Lab - it is unfair to Habran & Johnsa to hijack their thread. My fault 8)

Intel(R) Core(TM) i5-2450M CPU @ 2.50GHz (SSE4)

2963    cycles for 1000 * dec
2954    cycles for 1000 * sub

2962    cycles for 1000 * dec
2956    cycles for 1000 * sub


One more - a desperate attempt to insert all kinds of dependencies:
  .Repeat
dec eax
mov eax, dword ptr somestring
dec ecx
mov eax, dword ptr somestring
dec edx
je @F
dec eax
push eax
mov dword ptr somestring, eax
dec ecx
mov eax, dword ptr somestring
pop edx
sar edx, 1
dec edx
@@: dec eax
mov dword ptr somestring, eax
dec ecx
mov eax, dword ptr somestring
dec edx
dec ebx
  .Until Sign?


The sub reg, 1 version looks identical, it is just much, much longer. And the results are identical on my CPU:Intel(R) Core(TM) i5-2450M CPU @ 2.50GHz (SSE4)

12890   cycles for 1000 * dec
12909   cycles for 1000 * sub

12911   cycles for 1000 * dec
12919   cycles for 1000 * sub

12892   cycles for 1000 * dec
12892   cycles for 1000 * sub

12890   cycles for 1000 * dec
12893   cycles for 1000 * sub

12884   cycles for 1000 * dec
12894   cycles for 1000 * sub

45      bytes for dec
65      bytes for sub


My personal conclusion would be "if you still own a Prescott P4, and you need incredibly fast code for this machine, use sub reg, 1 instead of dec reg", after having carefully studied the explanations of Gilgamesz posted December 7, 2016:
QuoteThis is stale optimization advice left over from Pentium 4 ... Some modern compilers (including clang-3.8, and Intel's ICC 13) do use inc when optimizing for speed (-O3), not just for size

New version attached. I've even included the example by arafel posted here a bit more than ten years ago, but no luck with my strange Core i5, it just refuses to honour these theories :(

FORTRANS

Hi,

   Here is a quick result jj2007.

Intel(R) Pentium(R) M processor 1.70GHz (SSE2)

15825 cycles for 1000 * dec
24592 cycles for 1000 * sub

15828 cycles for 1000 * dec
24590 cycles for 1000 * sub

15839 cycles for 1000 * dec
24587 cycles for 1000 * sub

15821 cycles for 1000 * dec
24583 cycles for 1000 * sub

15832 cycles for 1000 * dec
24581 cycles for 1000 * sub

56 bytes for dec
74 bytes for sub


--- ok ---


   Won't run on older processors.

HTH,

Steve N.

nidud

#7
deleted

nidud

#8
deleted

hutch--

I cannot get a meaningful difference with a test piece like as folows.

; ¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤
    include \masm32\include\masm32rt.inc
    .686p
; ¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤

    .code
start:

    call main
    inkey
    exit

; ¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤

main proc

    LOCAL tc    :DWORD

  ; ----
  ; load
  ; ----
    mov edx, 4000000000
  @@:
    sub edx, 1
    jnz @B

    cpuid                       ; instruction cache flush
    invoke SleepEx,90,0         ; wait for some timeslices

  ; ----
  ; time
  ; ----
    invoke GetTickCount
    push eax

    mov edx, 4000000000

  @@:
  REPEAT 3                      ; unroll by 4
    sub edx, 1
    ; dec edx
    jz @F
  ENDM
    sub edx, 1
    ; dec edx
    jnz @B
  @@:

    invoke GetTickCount
    pop ecx
    sub eax, ecx

    print str$(eax)," timing ms",13,10

    ret

main endp

; ¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤

end start

ragdog


AMD Athlon(tm) II P360 Dual-Core Processor (SSE3)

6571    cycles for 1000 * dec
5825    cycles for 1000 * sub

5862    cycles for 1000 * dec
5824    cycles for 1000 * sub

5801    cycles for 1000 * dec
5770    cycles for 1000 * sub

5834    cycles for 1000 * dec
5874    cycles for 1000 * sub

5939    cycles for 1000 * dec
5810    cycles for 1000 * sub

9       bytes for dec
29      bytes for sub


--- ok ---

daydreamer

Why dont discuss test new instructions vs old ?
Like sse floats/sse2 integer packed instruction
That turn out to be useful when code parallel instructions
my none asm creations
https://masm32.com/board/index.php?topic=6937.msg74303#msg74303
I am an Invoker
"An Invoker is a mage who specializes in the manipulation of raw and elemental energies."
Like SIMD coding