Author Topic: Optimizing Collatz  (Read 2631 times)

nidud

  • Member
  • *****
  • Posts: 1612
    • https://github.com/nidud/asmc
Re: Optimizing Collatz
« Reply #15 on: November 17, 2017, 01:01:58 AM »
Here's a calibrated testbed for timing.

The code used: rcx=arg1, rdx=arg2
Code: [Select]
.x64
.model flat
.code
L2:
    mov     r8,rcx
    lea     rax,[rcx+rcx*2+1]
    shr     r8,1
    and     ecx,1
    cmove   rax,r8
    add     rdx,1
    cmp     rax,1
    mov     rcx,rax
    jne     L2
    ret
    end

The same code is loaded for each iteration so the result should be more or less equal:
Code: [Select]
Intel(R) Core(TM) i5-6500T CPU @ 2.50GHz (AVX2)
----------------------------------------------
-- (1000000)
   164428 cycles, rep(300), code( 32) 1.asm: algorithm 1
   162016 cycles, rep(300), code( 32) 2.asm: algorithm 2
   162394 cycles, rep(300), code( 32) 3.asm: algorithm 3
   164018 cycles, rep(300), code( 32) 4.asm: algorithm 4
-- (1000001)
   123084 cycles, rep(300), code( 32) 1.asm: algorithm 1
   123519 cycles, rep(300), code( 32) 2.asm: algorithm 2
   124434 cycles, rep(300), code( 32) 3.asm: algorithm 3
   123199 cycles, rep(300), code( 32) 4.asm: algorithm 4

total [1000000 .. 1000001], 1++
   285535 cycles 2.asm: algorithm 2
   286828 cycles 3.asm: algorithm 3
   287217 cycles 4.asm: algorithm 4
   287512 cycles 1.asm: algorithm 1
hit any key to continue...

hutch--

  • Administrator
  • Member
  • ******
  • Posts: 5767
  • Mnemonic Driven API Grinder
    • The MASM32 SDK
Re: Optimizing Collatz
« Reply #16 on: November 17, 2017, 01:16:24 AM »
See if you can find an AMD optimisation guide as AMD hardware usually does things differently to Intel hardware and the general optimisations used with an Intel processor may not be optimal for an AMD processor.
hutch at movsd dot com
http://www.masm32.com    :biggrin:  :biggrin:

fylux

  • Regular Member
  • *
  • Posts: 16
Re: Optimizing Collatz
« Reply #17 on: November 17, 2017, 01:20:51 AM »
Thank you very much for the test suit nidud!

This are my results:
Code: [Select]
-- (1000000)
   159518 cycles, rep(300), code( 32) 1.asm: algorithm 1
   153129 cycles, rep(300), code( 32) 2.asm: algorithm 2
   157345 cycles, rep(300), code( 32) 3.asm: algorithm 3
   153764 cycles, rep(300), code( 32) 4.asm: algorithm 4
-- (1000001)
   116474 cycles, rep(300), code( 32) 1.asm: algorithm 1
   116084 cycles, rep(300), code( 32) 2.asm: algorithm 2
   115590 cycles, rep(300), code( 32) 3.asm: algorithm 3
   115997 cycles, rep(300), code( 32) 4.asm: algorithm 4

total [1000000 .. 1000001], 1++
   269213 cycles 2.asm: algorithm 2
   269761 cycles 4.asm: algorithm 4
   272935 cycles 3.asm: algorithm 3
   275992 cycles 1.asm: algorithm 1
hit any key to continue...

Does it means that I can change the code of the asm files and run the same cmd program to test it?

fylux

  • Regular Member
  • *
  • Posts: 16
Re: Optimizing Collatz
« Reply #18 on: November 17, 2017, 01:25:51 AM »
See if you can find an AMD optimisation guide as AMD hardware usually does things differently to Intel hardware and the general optimisations used with an Intel processor may not be optimal for an AMD processor.

Well, we can use the optimization guide of Agner that include both AMD and Intel. However, I decided to start optimizing it for Skylake to make the task simpler.

nidud

  • Member
  • *****
  • Posts: 1612
    • https://github.com/nidud/asmc
Re: Optimizing Collatz
« Reply #19 on: November 17, 2017, 01:37:07 AM »
Does it means that I can change the code of the asm files and run the same cmd program to test it?

Yes.
If the results are more or less equal, as they are, you may modify one of them and retest.

In the main file:
procs   equ <for x,<1,2,3,4>> ; add functions to test...
You may change this to:
procs   equ <for x,<0,1>> ; add functions to test...

Now you can apply changes to 1.asm and compare it to 0.asm

You may also look into to the validate_x() function.
The return code is now in RDX so this could be tested with a fixed input:
Code: [Select]
        mov ecx,1000000
        mov edx,1
        mov ebx,153
        xor rax,rax
        call rsi
if 1
        .if edx != ebx
            printf( "error: rdx = %i (%i) %u.asm\n", edx, ebx, x )
            inc nerror
        .endif
endif

fylux

  • Regular Member
  • *
  • Posts: 16
Re: Optimizing Collatz
« Reply #20 on: November 17, 2017, 01:53:04 AM »
Thank you very much for the clarification.

I'm trying to do a variation with branches as some users proposed:
Code: [Select]
.x64
.model flat
.code
L2:
    and    ecx,1
    jz    @EVEN
    lea     rcx,[rcx+rcx*2+1]
    jmp   @ENDIF
@EVEN:
    shr     rcx,1
@ENDIF:
    add     rdx,1
    cmp     rcx,1
    jne     L2
    ret

    end

However this code doesn't work, so I'm trying to figure out the reason.

nidud

  • Member
  • *****
  • Posts: 1612
    • https://github.com/nidud/asmc
Re: Optimizing Collatz
« Reply #21 on: November 17, 2017, 02:22:32 AM »
and ecx,1 sets ecx to zero = eternal loop

fylux

  • Regular Member
  • *
  • Posts: 16
Re: Optimizing Collatz
« Reply #22 on: November 17, 2017, 02:28:41 AM »
Good point, I just made it run with "test" but I needed to move the register to eax.

Code: [Select]
.x64
.model flat
.code

mov rax,rcx
L2:
    test    rax,1
    jz   @EVEN
    lea     rax,[rax+rax*2+1]
    jmp  @ENDIF
@EVEN:
    shr     rax,1
@ENDIF:
    inc     rdx
    cmp     rax,1
    jne     L2
    ret

    end


fylux

  • Regular Member
  • *
  • Posts: 16
Re: Optimizing Collatz
« Reply #23 on: November 17, 2017, 02:39:57 AM »
Finally I've considered the 3 alternatives:

1º(branch modification)
Code: [Select]
mov rax,rcx
L2:
    test    rax,1
    jz     @EVEN
    lea     rax,[rax+rax*2+1]
    jmp     @ENDIF
@EVEN:
    shr     rax,1
@ENDIF:
    inc     rdx
    cmp     rax,1
    jne     L2
    ret

    end


2º (slightly modification)
Code: [Select]
L2:
    mov     r8,rcx
    lea     rax,[rcx+rcx*2+1]
    shr     r8,1
    test    rcx,1
    cmove   rax,r8
    inc     rdx
    cmp     rax,1
    mov     rcx,rax
    jne     L2
    ret

    end

3º (the one that nidud attached)
Code: [Select]
L2:
    mov     r8,rcx
    lea     rax,[rcx+rcx*2+1]
    shr     r8,1
    and     ecx,1
    cmove   rax,r8
    add     rdx,1
    cmp     rax,1
    mov     rcx,rax
    jne     L2
    ret

    end

If I run the tests my conclusion is that the results are the same. Although in this example you can see small differences it change along test but in conclusion they perform the same.

Code: [Select]
Intel(R) Core(TM) i7-6700HQ CPU @ 2.60GHz (AVX2)
----------------------------------------------
-- (1000000)
   194733 cycles, rep(300), code( 31) 1.asm: algorithm 1
   182347 cycles, rep(300), code( 35) 2.asm: algorithm 2
   179397 cycles, rep(300), code( 32) 3.asm: algorithm 3
-- (1000001)
   143990 cycles, rep(300), code( 31) 1.asm: algorithm 1
   134991 cycles, rep(300), code( 35) 2.asm: algorithm 2
   180371 cycles, rep(300), code( 32) 3.asm: algorithm 3

total [1000000 .. 1000001], 1++
   317338 cycles 2.asm: algorithm 2
   338723 cycles 1.asm: algorithm 1
   359768 cycles 3.asm: algorithm 3
hit any key to continue...

Therefore I still wonder if the code can be improved.

AW

  • Member
  • *****
  • Posts: 1490
  • Let's Make ASM Great Again!
Re: Optimizing Collatz
« Reply #24 on: November 17, 2017, 03:01:59 AM »

Don't jump to immediate conclusions, I am with Hutch on this one (cycles are misleading)
http://masm32.com/board/index.php?topic=6666.msg71484#msg71484

nidud

  • Member
  • *****
  • Posts: 1612
    • https://github.com/nidud/asmc
Re: Optimizing Collatz
« Reply #25 on: November 17, 2017, 03:08:22 AM »
Jumps are usually very slow so if you could do without, even by adding 5-6 instructions, it's usually faster.

Therefore I still wonder if the code can be improved.

always  :biggrin:

Code: [Select]
    add     rdx,1
    lea     rax,[rcx+rcx*2+1]
    shr     rcx,1
    cmovc   rcx,rax
    cmp     rcx,1
    jne     L2

nidud

  • Member
  • *****
  • Posts: 1612
    • https://github.com/nidud/asmc
Re: Optimizing Collatz
« Reply #26 on: November 17, 2017, 03:26:49 AM »
There's a list here where you see the individual instruction timing for AMD and Intel that may give you some idea which ones to use/avoid.

fylux

  • Regular Member
  • *
  • Posts: 16
Re: Optimizing Collatz
« Reply #27 on: November 17, 2017, 03:32:08 AM »

Don't jump to immediate conclusions, I am with Hutch on this one (cycles are misleading)
http://masm32.com/board/index.php?topic=6666.msg71484#msg71484

It's true, to measure time inside the code I will need to use chronos with nanoseconds in C++. I will have to generate it with the compiler and then add the changes that we have talked about in GAS.

fylux

  • Regular Member
  • *
  • Posts: 16
Re: Optimizing Collatz
« Reply #28 on: November 17, 2017, 03:36:25 AM »
Jumps are usually very slow so if you could do without, even by adding 5-6 instructions, it's usually faster.

Therefore I still wonder if the code can be improved.

always  :biggrin:

Code: [Select]
    add     rdx,1
    lea     rax,[rcx+rcx*2+1]
    shr     rcx,1
    cmovc   rcx,rax
    cmp     rcx,1
    jne     L2

This modification of the code looks better because you avoid some moves but unfortuntely I get this error:
Code: [Select]
error: rax = 7 (1) 2.asm
hit any key to continue...

By the way, asmc is an emulator or what is exactly?

And the most important.
It's true what you said about branches but it depends on the instructions. Because I can make something like:
n = (3n+1)*least significant bit + (n/2)*(1-least significant bit)
And it doesn't require branches. However, I would say that I tried it and was slower, but I cannot tell it for sure.

Thanks.

PS: Very interesting you benchmark about instructions. According to it bt should be faster than and for example.

nidud

  • Member
  • *****
  • Posts: 1612
    • https://github.com/nidud/asmc
Re: Optimizing Collatz
« Reply #29 on: November 17, 2017, 03:49:22 AM »
By the way, asmc is an emulator or what is exactly?

It's a Masm compatible assembler.

Quote
This modification of the code looks better because you avoid some moves but unfortuntely I get this error:

Yes, rax are trashed but the result (rdx) should be correct.