### Author Topic: Optimizing Collatz  (Read 2631 times)

#### nidud

• Member
• Posts: 1612
##### Re: Optimizing Collatz
« Reply #15 on: November 17, 2017, 01:01:58 AM »
Here's a calibrated testbed for timing.

The code used: rcx=arg1, rdx=arg2
Code: [Select]
`.x64.model flat.codeL2:    mov     r8,rcx    lea     rax,[rcx+rcx*2+1]    shr     r8,1    and     ecx,1    cmove   rax,r8    add     rdx,1    cmp     rax,1    mov     rcx,rax    jne     L2    ret    end`
The same code is loaded for each iteration so the result should be more or less equal:
Code: [Select]
`Intel(R) Core(TM) i5-6500T CPU @ 2.50GHz (AVX2)------------------------------------------------ (1000000)   164428 cycles, rep(300), code( 32) 1.asm: algorithm 1   162016 cycles, rep(300), code( 32) 2.asm: algorithm 2   162394 cycles, rep(300), code( 32) 3.asm: algorithm 3   164018 cycles, rep(300), code( 32) 4.asm: algorithm 4-- (1000001)   123084 cycles, rep(300), code( 32) 1.asm: algorithm 1   123519 cycles, rep(300), code( 32) 2.asm: algorithm 2   124434 cycles, rep(300), code( 32) 3.asm: algorithm 3   123199 cycles, rep(300), code( 32) 4.asm: algorithm 4total [1000000 .. 1000001], 1++   285535 cycles 2.asm: algorithm 2   286828 cycles 3.asm: algorithm 3   287217 cycles 4.asm: algorithm 4   287512 cycles 1.asm: algorithm 1hit any key to continue...`

#### hutch--

• Member
• Posts: 5767
• Mnemonic Driven API Grinder
##### Re: Optimizing Collatz
« Reply #16 on: November 17, 2017, 01:16:24 AM »
See if you can find an AMD optimisation guide as AMD hardware usually does things differently to Intel hardware and the general optimisations used with an Intel processor may not be optimal for an AMD processor.
hutch at movsd dot com
http://www.masm32.com

#### fylux

• Regular Member
• Posts: 16
##### Re: Optimizing Collatz
« Reply #17 on: November 17, 2017, 01:20:51 AM »
Thank you very much for the test suit nidud!

This are my results:
Code: [Select]
`-- (1000000)   159518 cycles, rep(300), code( 32) 1.asm: algorithm 1   153129 cycles, rep(300), code( 32) 2.asm: algorithm 2   157345 cycles, rep(300), code( 32) 3.asm: algorithm 3   153764 cycles, rep(300), code( 32) 4.asm: algorithm 4-- (1000001)   116474 cycles, rep(300), code( 32) 1.asm: algorithm 1   116084 cycles, rep(300), code( 32) 2.asm: algorithm 2   115590 cycles, rep(300), code( 32) 3.asm: algorithm 3   115997 cycles, rep(300), code( 32) 4.asm: algorithm 4total [1000000 .. 1000001], 1++   269213 cycles 2.asm: algorithm 2   269761 cycles 4.asm: algorithm 4   272935 cycles 3.asm: algorithm 3   275992 cycles 1.asm: algorithm 1hit any key to continue...`
Does it means that I can change the code of the asm files and run the same cmd program to test it?

#### fylux

• Regular Member
• Posts: 16
##### Re: Optimizing Collatz
« Reply #18 on: November 17, 2017, 01:25:51 AM »
See if you can find an AMD optimisation guide as AMD hardware usually does things differently to Intel hardware and the general optimisations used with an Intel processor may not be optimal for an AMD processor.

Well, we can use the optimization guide of Agner that include both AMD and Intel. However, I decided to start optimizing it for Skylake to make the task simpler.

#### nidud

• Member
• Posts: 1612
##### Re: Optimizing Collatz
« Reply #19 on: November 17, 2017, 01:37:07 AM »
Does it means that I can change the code of the asm files and run the same cmd program to test it?

Yes.
If the results are more or less equal, as they are, you may modify one of them and retest.

In the main file:
procs   equ <for x,<1,2,3,4>> ; add functions to test...
You may change this to:
procs   equ <for x,<0,1>> ; add functions to test...

Now you can apply changes to 1.asm and compare it to 0.asm

You may also look into to the validate_x() function.
The return code is now in RDX so this could be tested with a fixed input:
Code: [Select]
`        mov ecx,1000000        mov edx,1        mov ebx,153        xor rax,rax        call rsiif 1        .if edx != ebx            printf( "error: rdx = %i (%i) %u.asm\n", edx, ebx, x )            inc nerror        .endifendif`

#### fylux

• Regular Member
• Posts: 16
##### Re: Optimizing Collatz
« Reply #20 on: November 17, 2017, 01:53:04 AM »
Thank you very much for the clarification.

I'm trying to do a variation with branches as some users proposed:
Code: [Select]
`.x64.model flat.codeL2:    and    ecx,1    jz    @EVEN    lea     rcx,[rcx+rcx*2+1]    jmp   @ENDIF@EVEN:    shr     rcx,1@ENDIF:    add     rdx,1    cmp     rcx,1    jne     L2    ret    end`
However this code doesn't work, so I'm trying to figure out the reason.

#### nidud

• Member
• Posts: 1612
##### Re: Optimizing Collatz
« Reply #21 on: November 17, 2017, 02:22:32 AM »
and ecx,1 sets ecx to zero = eternal loop

#### fylux

• Regular Member
• Posts: 16
##### Re: Optimizing Collatz
« Reply #22 on: November 17, 2017, 02:28:41 AM »
Good point, I just made it run with "test" but I needed to move the register to eax.

Code: [Select]
`.x64.model flat.codemov rax,rcxL2:    test    rax,1    jz   @EVEN    lea     rax,[rax+rax*2+1]    jmp  @ENDIF@EVEN:    shr     rax,1@ENDIF:    inc     rdx    cmp     rax,1    jne     L2    ret    end`

#### fylux

• Regular Member
• Posts: 16
##### Re: Optimizing Collatz
« Reply #23 on: November 17, 2017, 02:39:57 AM »
Finally I've considered the 3 alternatives:

1º(branch modification)
Code: [Select]
` mov rax,rcxL2:    test    rax,1    jz     @EVEN    lea     rax,[rax+rax*2+1]    jmp     @ENDIF@EVEN:    shr     rax,1@ENDIF:    inc     rdx    cmp     rax,1    jne     L2    ret    end`

2º (slightly modification)
Code: [Select]
`L2:    mov     r8,rcx    lea     rax,[rcx+rcx*2+1]    shr     r8,1    test    rcx,1    cmove   rax,r8    inc     rdx    cmp     rax,1    mov     rcx,rax    jne     L2    ret    end`
3º (the one that nidud attached)
Code: [Select]
`L2:    mov     r8,rcx    lea     rax,[rcx+rcx*2+1]    shr     r8,1    and     ecx,1    cmove   rax,r8    add     rdx,1    cmp     rax,1    mov     rcx,rax    jne     L2    ret    end`
If I run the tests my conclusion is that the results are the same. Although in this example you can see small differences it change along test but in conclusion they perform the same.

Code: [Select]
`Intel(R) Core(TM) i7-6700HQ CPU @ 2.60GHz (AVX2)------------------------------------------------ (1000000)   194733 cycles, rep(300), code( 31) 1.asm: algorithm 1   182347 cycles, rep(300), code( 35) 2.asm: algorithm 2   179397 cycles, rep(300), code( 32) 3.asm: algorithm 3-- (1000001)   143990 cycles, rep(300), code( 31) 1.asm: algorithm 1   134991 cycles, rep(300), code( 35) 2.asm: algorithm 2   180371 cycles, rep(300), code( 32) 3.asm: algorithm 3total [1000000 .. 1000001], 1++   317338 cycles 2.asm: algorithm 2   338723 cycles 1.asm: algorithm 1   359768 cycles 3.asm: algorithm 3hit any key to continue...`
Therefore I still wonder if the code can be improved.

#### AW

• Member
• Posts: 1490
• Let's Make ASM Great Again!
##### Re: Optimizing Collatz
« Reply #24 on: November 17, 2017, 03:01:59 AM »

http://masm32.com/board/index.php?topic=6666.msg71484#msg71484

#### nidud

• Member
• Posts: 1612
##### Re: Optimizing Collatz
« Reply #25 on: November 17, 2017, 03:08:22 AM »
Jumps are usually very slow so if you could do without, even by adding 5-6 instructions, it's usually faster.

Therefore I still wonder if the code can be improved.

always

Code: [Select]
`    add     rdx,1    lea     rax,[rcx+rcx*2+1]    shr     rcx,1    cmovc   rcx,rax    cmp     rcx,1    jne     L2`

#### nidud

• Member
• Posts: 1612
##### Re: Optimizing Collatz
« Reply #26 on: November 17, 2017, 03:26:49 AM »
There's a list here where you see the individual instruction timing for AMD and Intel that may give you some idea which ones to use/avoid.

#### fylux

• Regular Member
• Posts: 16
##### Re: Optimizing Collatz
« Reply #27 on: November 17, 2017, 03:32:08 AM »

http://masm32.com/board/index.php?topic=6666.msg71484#msg71484

It's true, to measure time inside the code I will need to use chronos with nanoseconds in C++. I will have to generate it with the compiler and then add the changes that we have talked about in GAS.

#### fylux

• Regular Member
• Posts: 16
##### Re: Optimizing Collatz
« Reply #28 on: November 17, 2017, 03:36:25 AM »
Jumps are usually very slow so if you could do without, even by adding 5-6 instructions, it's usually faster.

Therefore I still wonder if the code can be improved.

always

Code: [Select]
`    add     rdx,1    lea     rax,[rcx+rcx*2+1]    shr     rcx,1    cmovc   rcx,rax    cmp     rcx,1    jne     L2`

This modification of the code looks better because you avoid some moves but unfortuntely I get this error:
Code: [Select]
`error: rax = 7 (1) 2.asmhit any key to continue...`
By the way, asmc is an emulator or what is exactly?

And the most important.
It's true what you said about branches but it depends on the instructions. Because I can make something like:
n = (3n+1)*least significant bit + (n/2)*(1-least significant bit)
And it doesn't require branches. However, I would say that I tried it and was slower, but I cannot tell it for sure.

Thanks.

PS: Very interesting you benchmark about instructions. According to it bt should be faster than and for example.

#### nidud

• Member
• Posts: 1612
##### Re: Optimizing Collatz
« Reply #29 on: November 17, 2017, 03:49:22 AM »
By the way, asmc is an emulator or what is exactly?

It's a Masm compatible assembler.

Quote
This modification of the code looks better because you avoid some moves but unfortuntely I get this error:

Yes, rax are trashed but the result (rdx) should be correct.