Optimizing Collatz

nidud · November 17, 2017, 01:01:58 AM

deleted

hutch-- · November 17, 2017, 01:16:24 AM

See if you can find an AMD optimisation guide as AMD hardware usually does things differently to Intel hardware and the general optimisations used with an Intel processor may not be optimal for an AMD processor.

fylux · November 17, 2017, 01:20:51 AM

Thank you very much for the test suit nidud!

This are my results:

Code Select


-- (1000000)
   159518 cycles, rep(300), code( 32) 1.asm: algorithm 1
   153129 cycles, rep(300), code( 32) 2.asm: algorithm 2
   157345 cycles, rep(300), code( 32) 3.asm: algorithm 3
   153764 cycles, rep(300), code( 32) 4.asm: algorithm 4
-- (1000001)
   116474 cycles, rep(300), code( 32) 1.asm: algorithm 1
   116084 cycles, rep(300), code( 32) 2.asm: algorithm 2
   115590 cycles, rep(300), code( 32) 3.asm: algorithm 3
   115997 cycles, rep(300), code( 32) 4.asm: algorithm 4

total [1000000 .. 1000001], 1++
   269213 cycles 2.asm: algorithm 2
   269761 cycles 4.asm: algorithm 4
   272935 cycles 3.asm: algorithm 3
   275992 cycles 1.asm: algorithm 1
hit any key to continue...

Does it means that I can change the code of the asm files and run the same cmd program to test it?

fylux · November 17, 2017, 01:25:51 AM

Quote from: hutch-- on November 17, 2017, 01:16:24 AM
See if you can find an AMD optimisation guide as AMD hardware usually does things differently to Intel hardware and the general optimisations used with an Intel processor may not be optimal for an AMD processor.

Well, we can use the optimization guide of Agner that include both AMD and Intel. However, I decided to start optimizing it for Skylake to make the task simpler.

nidud · November 17, 2017, 01:37:07 AM

deleted

fylux · November 17, 2017, 01:53:04 AM

Thank you very much for the clarification.

I'm trying to do a variation with branches as some users proposed:

Code Select


.x64
.model flat
.code
L2:
    and    ecx,1
    jz	   @EVEN
    lea     rcx,[rcx+rcx*2+1]
    jmp   @ENDIF
@EVEN:
    shr     rcx,1
@ENDIF:
    add     rdx,1
    cmp     rcx,1
    jne     L2
    ret

    end

However this code doesn't work, so I'm trying to figure out the reason.

nidud · November 17, 2017, 02:22:32 AM

deleted

fylux · November 17, 2017, 02:28:41 AM

Good point, I just made it run with "test" but I needed to move the register to eax.

Code Select


.x64
.model flat
.code

mov		rax,rcx
L2:
    test    rax,1
    jz	  @EVEN
    lea     rax,[rax+rax*2+1]
    jmp  @ENDIF
@EVEN:
    shr     rax,1
@ENDIF:
    inc     rdx
    cmp     rax,1
    jne     L2
    ret

    end

fylux · November 17, 2017, 02:39:57 AM

Finally I've considered the 3 alternatives:

1º(branch modification)

Code Select


 mov		rax,rcx
L2:
    test    rax,1
    jz	    @EVEN
    lea     rax,[rax+rax*2+1]
    jmp     @ENDIF
@EVEN:
    shr     rax,1
@ENDIF:
    inc     rdx
    cmp     rax,1
    jne     L2
    ret

    end

2º (slightly modification)

Code Select


L2:
    mov     r8,rcx
    lea     rax,[rcx+rcx*2+1]
    shr     r8,1
    test    rcx,1
    cmove   rax,r8
    inc     rdx
    cmp     rax,1
    mov     rcx,rax
    jne     L2
    ret

    end

3º (the one that nidud attached)

Code Select


L2:
    mov     r8,rcx
    lea     rax,[rcx+rcx*2+1]
    shr     r8,1
    and     ecx,1
    cmove   rax,r8
    add     rdx,1
    cmp     rax,1
    mov     rcx,rax
    jne     L2
    ret

    end

If I run the tests my conclusion is that the results are the same. Although in this example you can see small differences it change along test but in conclusion they perform the same.

Code Select

Intel(R) Core(TM) i7-6700HQ CPU @ 2.60GHz (AVX2)
----------------------------------------------
-- (1000000)
   194733 cycles, rep(300), code( 31) 1.asm: algorithm 1
   182347 cycles, rep(300), code( 35) 2.asm: algorithm 2
   179397 cycles, rep(300), code( 32) 3.asm: algorithm 3
-- (1000001)
   143990 cycles, rep(300), code( 31) 1.asm: algorithm 1
   134991 cycles, rep(300), code( 35) 2.asm: algorithm 2
   180371 cycles, rep(300), code( 32) 3.asm: algorithm 3

total [1000000 .. 1000001], 1++
   317338 cycles 2.asm: algorithm 2
   338723 cycles 1.asm: algorithm 1
   359768 cycles 3.asm: algorithm 3
hit any key to continue...

Therefore I still wonder if the code can be improved.

aw27 · November 17, 2017, 03:01:59 AM

Don't jump to immediate conclusions, I am with Hutch on this one (cycles are misleading)
http://masm32.com/board/index.php?topic=6666.msg71484#msg71484

nidud · November 17, 2017, 03:08:22 AM

deleted

nidud · November 17, 2017, 03:26:49 AM

deleted

fylux · November 17, 2017, 03:32:08 AM

Quote from: aw27 on November 17, 2017, 03:01:59 AM

Don't jump to immediate conclusions, I am with Hutch on this one (cycles are misleading)
http://masm32.com/board/index.php?topic=6666.msg71484#msg71484

It's true, to measure time inside the code I will need to use chronos with nanoseconds in C++. I will have to generate it with the compiler and then add the changes that we have talked about in GAS.

fylux · November 17, 2017, 03:36:25 AM

Quote from: nidud on November 17, 2017, 03:08:22 AM
Jumps are usually very slow so if you could do without, even by adding 5-6 instructions, it's usually faster.

Quote from: fylux on November 17, 2017, 02:39:57 AM
Therefore I still wonder if the code can be improved.

always

Code Select Expand
add rdx,1 lea rax,[rcx+rcx*2+1] shr rcx,1 cmovc rcx,rax cmp rcx,1 jne L2

This modification of the code looks better because you avoid some moves but unfortuntely I get this error:

Code Select

error: rax = 7 (1) 2.asm
hit any key to continue...

By the way, asmc is an emulator or what is exactly?

And the most important.
It's true what you said about branches but it depends on the instructions. Because I can make something like:
n = (3n+1)*least significant bit + (n/2)*(1-least significant bit)
And it doesn't require branches. However, I would say that I tried it and was slower, but I cannot tell it for sure.

Thanks.

PS: Very interesting you benchmark about instructions. According to it bt should be faster than and for example.

nidud · November 17, 2017, 03:49:22 AM

deleted

The MASM Forum

News:

Optimizing Collatz

nidud

hutch--

fylux

fylux

nidud

fylux

nidud

fylux

fylux

aw27

nidud

nidud

fylux

fylux

nidud