Author Topic: INC DEC versus ADD SUB 2  (Read 1505 times)

nidud

  • Member
  • *****
  • Posts: 1408
    • https://github.com/nidud/asmc
INC DEC versus ADD SUB 2
« on: December 30, 2016, 09:40:18 AM »
The correct encoding should be something like this:
Code: [Select]
.while !eax && !edx && ecx
And this also produce the correct result:
Code: [Select]
0000001C                        .while !eax && !edx && ecx
0000001C  EB01              *   jmp @C0009
0000001E                    *   @C0007:
0000001E  90                    nop
0000001F                        .endw
0000001F                    *   @C0009:
0000001F  85C0              *   test eax , eax
00000021  7508              *   jnz @C0008
00000023  85D2              *   test edx , edx
00000025  7504              *   jnz @C0008
00000027  85C9              *   test ecx, ecx
00000029  75F3              *   jnz @C0007
0000002B                    *   @C0008:

Before the changes was made this was also the case with .IF
Code: [Select]
00000000                        .if !(eax || edx) && ecx
00000000  85C0              *   test eax , eax
00000002  7509              *   jnz @C0001
00000004  85D2              *   test edx, edx
00000006  7505              *   jnz @C0001
00000008  85C9              *   test ecx, ecx
0000000A  7401              *   jz  @C0001
0000000C  90                    nop
0000000D                        .endif
0000000D                    *   @C0001:

hutch--

  • Administrator
  • Member
  • ******
  • Posts: 4922
  • Mnemonic Driven API Grinder
    • The MASM32 SDK
Re: INC DEC versus ADD SUB 2
« Reply #1 on: December 30, 2016, 11:45:35 AM »
JJ,

Funny results on my Haswell.

Intel(R) Core(TM) i7-5820K CPU @ 3.30GHz (SSE4)

??      cycles for 1000 * dec
??      cycles for 1000 * sub

??      cycles for 1000 * dec
??      cycles for 1000 * sub

??      cycles for 1000 * dec
??      cycles for 1000 * sub

??      cycles for 1000 * dec
??      cycles for 1000 * sub

??      cycles for 1000 * dec
??      cycles for 1000 * sub

0       bytes for dec
2       bytes for sub


--- ok ---
hutch at movsd dot com
http://www.masm32.com    :biggrin:  :biggrin:

jj2007

  • Member
  • *****
  • Posts: 7728
  • Assembler is fun ;-)
    • MasmBasic
Re: INC DEC versus ADD SUB 2
« Reply #2 on: December 30, 2016, 07:42:29 PM »
Funny results on my Haswell.

Try this one.
Code: [Select]
NameA equ dec ; assign a descriptive name here
TestA proc
  mov ebx, AlgoLoops-1 ; loop e.g. 100x
  align 4
  .Repeat
dec eax
dec ecx
dec edx
dec eax
dec ecx
dec edx
dec eax
dec ecx
dec edx
dec ebx
  .Until Sign?
  ret
TestA endp
TestA_endp:

align_64
TestB_s:
NameB equ sub ; assign a descriptive name here
TestB proc
  mov ebx, AlgoLoops-1 ; loop e.g. 100x
  align 4
  .Repeat
sub eax, 1
sub ecx, 1
sub edx, 1
sub eax, 1
sub ecx, 1
sub edx, 1
sub eax, 1
sub ecx, 1
sub edx, 1
sub ebx, 1
  .Until Sign?
  ret
TestB endp
TestB_endp:

TWell

  • Member
  • ****
  • Posts: 748
Re: INC DEC versus ADD SUB 2
« Reply #3 on: December 30, 2016, 08:39:14 PM »
Code: [Select]
AMD Athlon(tm) II X2 220 Processor (SSE3)

2030    cycles for 1000 * dec
2028    cycles for 1000 * sub

2129    cycles for 1000 * dec
2034    cycles for 1000 * sub

2031    cycles for 1000 * dec
2027    cycles for 1000 * sub

2038    cycles for 1000 * dec
2026    cycles for 1000 * sub

2030    cycles for 1000 * dec
2028    cycles for 1000 * sub

9       bytes for dec
29      bytes for sub

hutch--

  • Administrator
  • Member
  • ******
  • Posts: 4922
  • Mnemonic Driven API Grinder
    • The MASM32 SDK
Re: INC DEC versus ADD SUB 2
« Reply #4 on: December 30, 2016, 08:53:46 PM »
The later version at least produces numbers.  :biggrin:

Intel(R) Core(TM) i7-5820K CPU @ 3.30GHz (SSE4)

2368    cycles for 1000 * dec
2351    cycles for 1000 * sub

2359    cycles for 1000 * dec
2346    cycles for 1000 * sub

2355    cycles for 1000 * dec
2348    cycles for 1000 * sub

2357    cycles for 1000 * dec
2354    cycles for 1000 * sub

2361    cycles for 1000 * dec
2349    cycles for 1000 * sub

9       bytes for dec
29      bytes for sub


--- ok ---


There is an unusual effect with this Haswell, when its idling the processor drops the frequency from 3.3 gig to about 1.2 gig so when I time anything I run a load first to get it to come up to speed then time the algo.

Now back to ADD SUB versus INC DEC, Intel say that ADD SUB is faster because it does not set a flag so while the simplified timing produces similar results, the real test is in an optimised algo where there is an interaction between instruction choices. It has been this way since the end of the .386 with pipelines, out of order execution and so on. Strangely enough Intel usually do know what they are talking about.

None of this matters with AMD hardware which is internally different to Intel processors.
hutch at movsd dot com
http://www.masm32.com    :biggrin:  :biggrin:

jj2007

  • Member
  • *****
  • Posts: 7728
  • Assembler is fun ;-)
    • MasmBasic
Re: INC DEC versus ADD SUB 2
« Reply #5 on: December 30, 2016, 10:03:51 PM »
The later version at least produces numbers.  :biggrin:
...
Intel say that ADD SUB is faster because it does not set a flag so while the simplified timing produces similar results, the real test is in an optimised algo where there is an interaction between instruction choices.

I've been waiting some years now for a real world example of this mysterious behaviour. In the meantime, as long as this example doesn't pop up, I will continue to use the instruction cache friendly inc and dec instructions. Btw, please shove this intermezzo into the Lab - it is unfair to Habran & Johnsa to hijack their thread. My fault 8)

Code: [Select]
Intel(R) Core(TM) i5-2450M CPU @ 2.50GHz (SSE4)

2963    cycles for 1000 * dec
2954    cycles for 1000 * sub

2962    cycles for 1000 * dec
2956    cycles for 1000 * sub

One more - a desperate attempt to insert all kinds of dependencies:
Code: [Select]
  .Repeat
dec eax
mov eax, dword ptr somestring
dec ecx
mov eax, dword ptr somestring
dec edx
je @F
dec eax
push eax
mov dword ptr somestring, eax
dec ecx
mov eax, dword ptr somestring
pop edx
sar edx, 1
dec edx
@@: dec eax
mov dword ptr somestring, eax
dec ecx
mov eax, dword ptr somestring
dec edx
dec ebx
  .Until Sign?

The sub reg, 1 version looks identical, it is just much, much longer. And the results are identical on my CPU:
Code: [Select]
Intel(R) Core(TM) i5-2450M CPU @ 2.50GHz (SSE4)

12890   cycles for 1000 * dec
12909   cycles for 1000 * sub

12911   cycles for 1000 * dec
12919   cycles for 1000 * sub

12892   cycles for 1000 * dec
12892   cycles for 1000 * sub

12890   cycles for 1000 * dec
12893   cycles for 1000 * sub

12884   cycles for 1000 * dec
12894   cycles for 1000 * sub

45      bytes for dec
65      bytes for sub

My personal conclusion would be "if you still own a Prescott P4, and you need incredibly fast code for this machine, use sub reg, 1 instead of dec reg", after having carefully studied the explanations of Gilgamesz posted December 7, 2016:
Quote
This is stale optimization advice left over from Pentium 4 ... Some modern compilers (including clang-3.8, and Intel’s ICC 13) do use inc when optimizing for speed (-O3), not just for size

New version attached. I've even included the example by arafel posted here a bit more than ten years ago, but no luck with my strange Core i5, it just refuses to honour these theories :(
« Last Edit: December 30, 2016, 11:45:07 PM by jj2007 »

FORTRANS

  • Member
  • ****
  • Posts: 945
Re: INC DEC versus ADD SUB 2
« Reply #6 on: December 31, 2016, 01:17:46 AM »
Hi,

   Here is a quick result jj2007.

Code: [Select]
Intel(R) Pentium(R) M processor 1.70GHz (SSE2)

15825 cycles for 1000 * dec
24592 cycles for 1000 * sub

15828 cycles for 1000 * dec
24590 cycles for 1000 * sub

15839 cycles for 1000 * dec
24587 cycles for 1000 * sub

15821 cycles for 1000 * dec
24583 cycles for 1000 * sub

15832 cycles for 1000 * dec
24581 cycles for 1000 * sub

56 bytes for dec
74 bytes for sub


--- ok ---

   Won't run on older processors.

HTH,

Steve N.

nidud

  • Member
  • *****
  • Posts: 1408
    • https://github.com/nidud/asmc
Re: INC DEC versus ADD SUB 2
« Reply #7 on: December 31, 2016, 01:37:36 AM »
Some old CPU test result

http://masm32.com/board/index.php?topic=3396.msg35930#msg35930

Code: [Select]
Instruction Clock Cycle Calculation

1. AMD Athlon(tm) II X2 245 Processor
2. Intel(R) Core(TM) i3-2367M CPU @ 1.40GHz

Instr.  Operands      Size  CPU1  CPU2
------------------------------------------------------
DEC reg        [1]   2.8   2.5
DEC r16        [2]   2.7   2.6
DEC r08        [2]   2.7   2.5
DEC mem        [6]  19.1  16.5

SUB reg,reg        [2]   1.0   1.2
SUB reg,mem        [6]   2.8   2.6
SUB reg,imm        [3]   2.8   2.6
SUB acc,imm        [3]   2.8   2.8
SUB mem,reg        [6]  19.1  16.1

nidud

  • Member
  • *****
  • Posts: 1408
    • https://github.com/nidud/asmc
Re: INC DEC versus ADD SUB 2
« Reply #8 on: December 31, 2016, 01:49:33 AM »
Quote
AMD Athlon(tm) II X2 245 Processor (SSE3)

11467   cycles for 1000 * dec
11073   cycles for 1000 * sub

11467   cycles for 1000 * dec
11067   cycles for 1000 * sub

11486   cycles for 1000 * dec
11070   cycles for 1000 * sub

11467   cycles for 1000 * dec
11066   cycles for 1000 * sub

11474   cycles for 1000 * dec
11137   cycles for 1000 * sub

56      bytes for dec
74      bytes for sub

hutch--

  • Administrator
  • Member
  • ******
  • Posts: 4922
  • Mnemonic Driven API Grinder
    • The MASM32 SDK
Re: INC DEC versus ADD SUB 2
« Reply #9 on: December 31, 2016, 01:59:50 AM »
I cannot get a meaningful difference with a test piece like as folows.

; ¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤
    include \masm32\include\masm32rt.inc
    .686p
; ¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤

    .code
start:

    call main
    inkey
    exit

; ¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤

main proc

    LOCAL tc    :DWORD

  ; ----
  ; load
  ; ----
    mov edx, 4000000000
  @@:
    sub edx, 1
    jnz @B

    cpuid                       ; instruction cache flush
    invoke SleepEx,90,0         ; wait for some timeslices

  ; ----
  ; time
  ; ----
    invoke GetTickCount
    push eax

    mov edx, 4000000000

  @@:
  REPEAT 3                      ; unroll by 4
    sub edx, 1
    ; dec edx
    jz @F
  ENDM
    sub edx, 1
    ; dec edx
    jnz @B
  @@:

    invoke GetTickCount
    pop ecx
    sub eax, ecx

    print str$(eax)," timing ms",13,10

    ret

main endp

; ¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤

end start
hutch at movsd dot com
http://www.masm32.com    :biggrin:  :biggrin:

ragdog

  • Member
  • ****
  • Posts: 523
Re: INC DEC versus ADD SUB 2
« Reply #10 on: December 31, 2016, 04:29:17 AM »
Code: [Select]
AMD Athlon(tm) II P360 Dual-Core Processor (SSE3)

6571    cycles for 1000 * dec
5825    cycles for 1000 * sub

5862    cycles for 1000 * dec
5824    cycles for 1000 * sub

5801    cycles for 1000 * dec
5770    cycles for 1000 * sub

5834    cycles for 1000 * dec
5874    cycles for 1000 * sub

5939    cycles for 1000 * dec
5810    cycles for 1000 * sub

9       bytes for dec
29      bytes for sub


--- ok ---

daydreamer2

  • Member
  • **
  • Posts: 165
Re: INC DEC versus ADD SUB 2
« Reply #11 on: February 16, 2017, 12:25:09 AM »
Why dont discuss test new instructions vs old ?
Like sse floats/sse2 integer packed instruction
That turn out to be useful when code parallel instructions