### Author Topic: Optimizing Collatz  (Read 3217 times)

#### fylux

• Regular Member
• Posts: 16
##### Optimizing Collatz
« on: November 16, 2017, 10:32:08 PM »
Hi everybody!
First of all excuse in advance if this topic is not advanced enough for this section of the forum. Having said that this is the idea.

Collatz conjeture is based on the following idea:
-Take a number
-If it's even divide it by 2
-If it's odd multiply it by 3 and add 1
-Repeat the process until you reach 1

I won't enter into more detail because is not necessary. Now, what I'm trying to do is to find the fastest algorithm to find how many steps it takes to reach 1. The platform where I want to optimize it is Skylake, but I don't think that Skylake will have any advantage compare with other new architectures for this code.

Well, the fastest code that I've found is the following:
Code: [Select]
`.L2:   movq  %rdx, %rsi   leaq  1(%rdx,%rdx,2), %rcx   shrq  %rsi   andl  \$1, %edx   cmove  %rsi, %rcx   addq  \$1, %rax   cmpq  \$1, %rcx   movq  %rcx, %rdx   jne  .L2`
In this code there are a few tricks like use leaq or conditional move (things that I didn't know that exist a few months ago). However I wonder if this part of the code can be further improved. Because it is fully sequential it cannot be parallelized neither vectorized. My only idea is to look for other instructions with lower latency or to find if there is any stall in the code since for example a conditional move is also a branch.

PS: Find attached the whole asm program.

#### Jokaste

• Regular Member
• Posts: 47
• Never be pleased, always improve
##### Re: Optimizing Collatz
« Reply #1 on: November 16, 2017, 10:50:40 PM »
Quote

-Take a number.

-If it's even divide it by 2
-If it's odd multiply it by 3 and add 1

If it is even bit 0 = 0 else bit 0 = 1
replace

Quote
with
Code: [Select]
`inc  %rax`Align label L2
I don't understand this syntax.
Quote
andl  \$1, %edx
Why do you don't use BTx if you always test the bit 1.

Quote
cmpq  \$1, %rcxmovq  %rcx, %rdxjne  .L2
A suggestion

Code: [Select]
`dec rcxmovq  %rcx, %rdxjnz L2`

My proposal
Code: [Select]
`  [00000001400017A7] 41BA0D000000                 mov               r10d,D  [00000001400017AD] B901000000                   mov               ecx,1  [00000001400017B2] EB0C                         jmp               00000001400017C0  [00000001400017B4] 66666690                     nop                 [00000001400017B8] 66666690                     nop                 [00000001400017BC] 66666690                     nop                 [00000001400017C0] 4939CA                       cmp               r10,rcx  [00000001400017C3] 742B                         je                00000001400017F0  [00000001400017C5] 490FBAE200                   bt                r10,0  [00000001400017CA] 7314                         jae               00000001400017E0  [00000001400017CC] EB02                         jmp               00000001400017D0  [00000001400017CE] 6690                         nop                 [00000001400017D0] 4F8D1452                     lea               r10,[r10+r10*2]  [00000001400017D4] 49FFC2                       inc               r10  [00000001400017D7] EBE7                         jmp               00000001400017C0  [00000001400017D9] 66666690                     nop                 [00000001400017DD] 666690                       nop                 [00000001400017E0] 49D1EA                       shr               r10,1  [00000001400017E3] 75DB                         jne               00000001400017C0  [00000001400017E5] 66666690                     nop                 [00000001400017E9] 66666690                     nop                 [00000001400017ED] 666690                       nop               @Finished :`
Kenavo
---------------------------
Grincheux / Jokaste

#### jj2007

• Member
• Posts: 8963
• Assembler is fun ;-)
##### Re: Optimizing Collatz
« Reply #2 on: November 16, 2017, 11:01:39 PM »
There is a related post here: http://masm32.com/board/index.php?topic=4165.msg44290#msg44290
Your code looks odd - is that GAS syntax? I would strongly suggest to rewrite it for Masm, as nobody here is familiar with GAS.

Welcome to the forum

#### AW

• Member
• Posts: 1657
• Let's Make ASM Great Again!
##### Re: Optimizing Collatz
« Reply #3 on: November 16, 2017, 11:12:38 PM »
fylux,

May be someone here can improve on that. However, people here does not deal much with GAS and the System V ABI.

#### fylux

• Regular Member
• Posts: 16
##### Re: Optimizing Collatz
« Reply #4 on: November 16, 2017, 11:15:53 PM »
Jokaste, thank you very much for the proposals.

However, I have a few doubts.

Nice note about using inc. I will research about the performance difference of those add and inc.

Sorry but I don't know yet what BTx is, therefore I will look for to understand you are proposing because it's true that I only use the bit 1.
EDIT: I found about it, and it looks more suitable but in the guide of Agner fog about instructions in Skylake all the stats are the same for "and" and "bt" but the Reciprocal through put which is 0.25 with and, 0.5 with bt.

About you suggestion of dec and jnz I think that there is a problem (maybe I'm wrong). If you decrement rcx and then you move it to rdx it will change the result since in every iteration I will be decrementing 1.

#### fylux

• Regular Member
• Posts: 16
##### Re: Optimizing Collatz
« Reply #5 on: November 16, 2017, 11:18:44 PM »
jj2007 and aw27, Nice you meet you.

Indeed this is GAS code, because I don't know enough about how to write ASM code by myself so I just tweak the code generated by GCC. I understand that many people will have problem working with GAS but I thought that for a simple example like this it wouldn't be a problem.

#### jj2007

• Member
• Posts: 8963
• Assembler is fun ;-)
##### Re: Optimizing Collatz
« Reply #6 on: November 16, 2017, 11:34:58 PM »
Indeed this is GAS code, because I don't know enough about how to write ASM code by myself so I just tweak the code generated by GCC.

What do you get if you use gcc -S -masm=intel test.c ?

#### Jokaste

• Regular Member
• Posts: 47
• Never be pleased, always improve
##### Re: Optimizing Collatz
« Reply #7 on: November 16, 2017, 11:43:28 PM »
FyLux you are right this a reason I wrote an entire code... and tested it. I use PoAsm assembler. It'c a C compiler and a 64 bits assembler too.
Kenavo
---------------------------
Grincheux / Jokaste

#### fylux

• Regular Member
• Posts: 16
##### Re: Optimizing Collatz
« Reply #8 on: November 16, 2017, 11:45:15 PM »
Indeed this is GAS code, because I don't know enough about how to write ASM code by myself so I just tweak the code generated by GCC.

What do you get if you use gcc -S -masm=intel test.c ?

Good idea. If I use the equivalent command:
Code: [Select]
`g++ -S -masm=intel collatz.cpp -std=c++11 -O3`I get this:
Code: [Select]
`.LFB0: .cfi_startproc mov edx, 1000000 mov eax, 1 .p2align 4,,10 .p2align 3.L2: mov rsi, rdx lea rcx, [rdx+1+rdx*2] shr rsi and edx, 1 cmove rcx, rsi add rax, 1 cmp rcx, 1 mov rdx, rcx jne .L2 rep ret`

#### fylux

• Regular Member
• Posts: 16
##### Re: Optimizing Collatz
« Reply #9 on: November 16, 2017, 11:50:57 PM »
FyLux you are right this a reason I wrote an entire code... and tested it. I use PoAsm assembler. It'c a C compiler and a 64 bits assembler too.

Thank you very much for taking the effort of writing an entire code.  Your code it is more elegant but I still wonder if it is faster. BTW remember that you still have to add a counter to measure the number of iterations.
I don't know if it is better to rely on branches or on cmove. The other change is the BT and I'm not sure yet if it will affect performance. And about the  instruction order I wonder if it is imporant. In my approach I compute both possibilities before the branch so you may think that is less efficient. But I think that if you have to wait until the branch is solved maybe you are loosing time and it's better to precalculate both possiblities. However because this is Out of Order processor maybe it already figures out by itself.

What do you think?  I wish I knew any program in Linux that shows the stalls between instructions.

BTW Maybe now you want to check my code generated in MASM style.

#### AW

• Member
• Posts: 1657
• Let's Make ASM Great Again!
##### Re: Optimizing Collatz
« Reply #10 on: November 16, 2017, 11:53:31 PM »
Let me know how this fares:

Code: [Select]
`;\masm32\bin\ml64  -c -Zp8 coll.asm;\masm32\bin\link /ENTRY:main /SUBSYSTEM:console /MACHINE:X64 coll.objincludelib \masm32\lib64\msvcrt.libprintf PROTO :PTR, :VARARGincludelib \masm32\lib64\kernel32.libExitProcess PROTO :DWORD.datamyNumber dq 07eeeeeeeeeeeeeeehfmt db "Number of iterations: %d",10,0.codemain proc sub rsp, 28h mov rdx,0 mov rax, myNumber@loop: test rax, 1 jz @divBy2 lea rax, QWORD PTR [rax+rax*2+1] jmp short @comp@divBy2: shr rax, 1@comp: inc rdx cmp rax, 1 jne @loop lea rcx, fmt call printf mov ecx, 0 call ExitProcessmain endpend`
Number of iteration:365

#### Jokaste

• Regular Member
• Posts: 47
• Never be pleased, always improve
##### Re: Optimizing Collatz
« Reply #11 on: November 16, 2017, 11:54:26 PM »
Quote
rep ret[/qote][/size]
or
Quote
nop
rep

I read about this some days ago, Amd or Intel gives a macro for REP RET because some assembler don't allow REP prefix before RET. The problem in that case is that i am not sure the correct EPILOG is written.
Kenavo
---------------------------
Grincheux / Jokaste

#### fylux

• Regular Member
• Posts: 16
##### Re: Optimizing Collatz
« Reply #12 on: November 17, 2017, 12:42:44 AM »
Let me know how this fares:

Code: [Select]
`...`
Number of iteration:365

Can you change it to return the number of iterations in the main instead of a print? It's to make it easier to measure. This way maybe I can try to compile it on Linux.
In the worst case since is pretty similar to mine but changing the cmove with branches maybe I can program something equivalent from scratch.

#### fylux

• Regular Member
• Posts: 16
##### Re: Optimizing Collatz
« Reply #13 on: November 17, 2017, 12:43:45 AM »
Quote
rep ret[/qote][/size]
or
Quote
nop
rep

I read about this some days ago, Amd or Intel gives a macro for REP RET because some assembler don't allow REP prefix before RET. The problem in that case is that i am not sure the correct EPILOG is written.

I'm not sure about what do you mean but I think that compilers started to add it because of some AMD issue and now I find it in every code generated by GCC but I think that is useless.

#### AW

• Member
• Posts: 1657
• Let's Make ASM Great Again!
##### Re: Optimizing Collatz
« Reply #14 on: November 17, 2017, 12:54:10 AM »
Can you change it to return the number of iterations in the main instead of a print? It's to make it easier to measure. This way maybe I can try to compile it on Linux.
In the worst case since is pretty similar to mine but changing the cmove with branches maybe I can program something equivalent from scratch.

integer are returned in eax. Now, is your turn.
PS: Conditional move ASM instructions are probably slower than you are expecting or wishing. Last time I tested they were.