Optimizing Collatz

fylux · November 16, 2017, 10:32:08 PM

Hi everybody!
First of all excuse in advance if this topic is not advanced enough for this section of the forum. Having said that this is the idea.

Collatz conjeture is based on the following idea:
-Take a number
-If it's even divide it by 2
-If it's odd multiply it by 3 and add 1
-Repeat the process until you reach 1

I won't enter into more detail because is not necessary. Now, what I'm trying to do is to find the fastest algorithm to find how many steps it takes to reach 1. The platform where I want to optimize it is Skylake, but I don't think that Skylake will have any advantage compare with other new architectures for this code.

Well, the fastest code that I've found is the following:

Code Select


.L2:
   movq  %rdx, %rsi
   leaq  1(%rdx,%rdx,2), %rcx
   shrq  %rsi
   andl  $1, %edx
   cmove  %rsi, %rcx
   addq  $1, %rax
   cmpq  $1, %rcx
   movq  %rcx, %rdx
   jne  .L2

In this code there are a few tricks like use leaq or conditional move (things that I didn't know that exist a few months ago). However I wonder if this part of the code can be further improved. Because it is fully sequential it cannot be parallelized neither vectorized. My only idea is to look for other instructions with lower latency or to find if there is any stall in the code since for example a conditional move is also a branch.

Thank you in advance.

PS: Find attached the whole asm program.

Jokaste · November 16, 2017, 10:50:40 PM

Quote
-Take a number.
-If it's even divide it by 2
-If it's odd multiply it by 3 and add 1

If it is even bit 0 = 0 else bit 0 = 1
replace

Quoteaddq $1, %rax

with

Code Select

inc %raxAlign label L2
I don't understand this syntax.

Quoteandl $1, %edx

Why do you don't use BTx if you always test the bit 1.

Quotecmpq $1, %rcxmovq %rcx, %rdxjne .L2

A suggestion

Code Select


dec rcx
movq  %rcx, %rdx
jnz L2

My proposal

Code Select



  [00000001400017A7] 41BA0D000000                 mov               r10d,D
  [00000001400017AD] B901000000                   mov               ecx,1
  [00000001400017B2] EB0C                         jmp               00000001400017C0
  [00000001400017B4] 66666690                     nop               
  [00000001400017B8] 66666690                     nop               
  [00000001400017BC] 66666690                     nop               
  [00000001400017C0] 4939CA                       cmp               r10,rcx
  [00000001400017C3] 742B                         je                00000001400017F0
  [00000001400017C5] 490FBAE200                   bt                r10,0
  [00000001400017CA] 7314                         jae               00000001400017E0
  [00000001400017CC] EB02                         jmp               00000001400017D0
  [00000001400017CE] 6690                         nop               
  [00000001400017D0] 4F8D1452                     lea               r10,[r10+r10*2]
  [00000001400017D4] 49FFC2                       inc               r10
  [00000001400017D7] EBE7                         jmp               00000001400017C0
  [00000001400017D9] 66666690                     nop               
  [00000001400017DD] 666690                       nop               
  [00000001400017E0] 49D1EA                       shr               r10,1
  [00000001400017E3] 75DB                         jne               00000001400017C0
  [00000001400017E5] 66666690                     nop               
  [00000001400017E9] 66666690                     nop               
  [00000001400017ED] 666690                       nop               @Finished :

jj2007 · November 16, 2017, 11:01:39 PM

There is a related post here: http://masm32.com/board/index.php?topic=4165.msg44290#msg44290
Your code looks odd - is that GAS syntax? I would strongly suggest to rewrite it for Masm, as nobody here is familiar with GAS.

Welcome to the forum :icon14:

aw27 · November 16, 2017, 11:12:38 PM

fylux,

May be someone here can improve on that. However, people here does not deal much with GAS and the System V ABI.

fylux · November 16, 2017, 11:15:53 PM

Jokaste, thank you very much for the proposals.

However, I have a few doubts.

Nice note about using inc. I will research about the performance difference of those add and inc.

Sorry but I don't know yet what BTx is, therefore I will look for to understand you are proposing because it's true that I only use the bit 1.
EDIT: I found about it, and it looks more suitable but in the guide of Agner fog about instructions in Skylake all the stats are the same for "and" and "bt" but the Reciprocal through put which is 0.25 with and, 0.5 with bt.

About you suggestion of dec and jnz I think that there is a problem (maybe I'm wrong). If you decrement rcx and then you move it to rdx it will change the result since in every iteration I will be decrementing 1.

fylux · November 16, 2017, 11:18:44 PM

jj2007 and aw27, Nice you meet you.

Indeed this is GAS code, because I don't know enough about how to write ASM code by myself so I just tweak the code generated by GCC. I understand that many people will have problem working with GAS but I thought that for a simple example like this it wouldn't be a problem.

jj2007 · November 16, 2017, 11:34:58 PM

Quote from: fylux on November 16, 2017, 11:18:44 PMIndeed this is GAS code, because I don't know enough about how to write ASM code by myself so I just tweak the code generated by GCC.

What do you get if you use gcc -S -masm=intel test.c ?

Jokaste · November 16, 2017, 11:43:28 PM

FyLux you are right this a reason I wrote an entire code... and tested it. I use PoAsm assembler. It'c a C compiler and a 64 bits assembler too.

fylux · November 16, 2017, 11:45:15 PM

Quote from: jj2007 on November 16, 2017, 11:34:58 PM
Quote from: fylux on November 16, 2017, 11:18:44 PMIndeed this is GAS code, because I don't know enough about how to write ASM code by myself so I just tweak the code generated by GCC.

What do you get if you use gcc -S -masm=intel test.c ?

Good idea. If I use the equivalent command:

Code Select

g++ -S -masm=intel collatz.cpp -std=c++11 -O3
I get this:

Code Select


.LFB0:
	.cfi_startproc
	mov	edx, 1000000
	mov	eax, 1
	.p2align 4,,10
	.p2align 3
.L2:
	mov	rsi, rdx
	lea	rcx, [rdx+1+rdx*2]
	shr	rsi
	and	edx, 1
	cmove	rcx, rsi
	add	rax, 1
	cmp	rcx, 1
	mov	rdx, rcx
	jne	.L2
	rep ret

fylux · November 16, 2017, 11:50:57 PM

Quote from: Jokaste on November 16, 2017, 11:43:28 PM
FyLux you are right this a reason I wrote an entire code... and tested it. I use PoAsm assembler. It'c a C compiler and a 64 bits assembler too.

Thank you very much for taking the effort of writing an entire code. Your code it is more elegant but I still wonder if it is faster. BTW remember that you still have to add a counter to measure the number of iterations.
I don't know if it is better to rely on branches or on cmove. The other change is the BT and I'm not sure yet if it will affect performance. And about the instruction order I wonder if it is imporant. In my approach I compute both possibilities before the branch so you may think that is less efficient. But I think that if you have to wait until the branch is solved maybe you are loosing time and it's better to precalculate both possiblities. However because this is Out of Order processor maybe it already figures out by itself.

What do you think? I wish I knew any program in Linux that shows the stalls between instructions.

BTW Maybe now you want to check my code generated in MASM style.

aw27 · November 16, 2017, 11:53:31 PM

Let me know how this fares:

Code Select


;\masm32\bin\ml64  -c -Zp8 coll.asm
;\masm32\bin\link /ENTRY:main /SUBSYSTEM:console /MACHINE:X64 coll.obj

includelib \masm32\lib64\msvcrt.lib
printf PROTO :PTR, :VARARG
includelib \masm32\lib64\kernel32.lib
ExitProcess PROTO :DWORD

.data
myNumber dq 07eeeeeeeeeeeeeeeh
fmt db "Number of iterations: %d",10,0

.code

main proc
	sub rsp, 28h
	mov rdx,0
	mov rax, myNumber
@loop:	
	test rax, 1
	jz @divBy2
	lea rax, QWORD PTR [rax+rax*2+1]
	jmp short @comp
@divBy2:	
	shr rax, 1
@comp:
	inc rdx
	cmp rax, 1
	jne @loop
	
	lea rcx, fmt
	call printf
	
	mov ecx, 0
	call ExitProcess
main endp

end

Number of iteration:365

Jokaste · November 16, 2017, 11:54:26 PM

Quoterep ret[/qote]
or
Quotenop
rep

I read about this some days ago, Amd or Intel gives a macro for REP RET because some assembler don't allow REP prefix before RET. The problem in that case is that i am not sure the correct EPILOG is written.

fylux · November 17, 2017, 12:42:44 AM

Quote from: aw27 on November 16, 2017, 11:53:31 PM
Let me know how this fares:

Code Select Expand
...

Number of iteration:365

Can you change it to return the number of iterations in the main instead of a print? It's to make it easier to measure. This way maybe I can try to compile it on Linux.
In the worst case since is pretty similar to mine but changing the cmove with branches maybe I can program something equivalent from scratch.

fylux · November 17, 2017, 12:43:45 AM

Quote from: Jokaste on November 16, 2017, 11:54:26 PM
Quoterep ret[/qote]
or
Quotenop
rep

I read about this some days ago, Amd or Intel gives a macro for REP RET because some assembler don't allow REP prefix before RET. The problem in that case is that i am not sure the correct EPILOG is written.

I'm not sure about what do you mean but I think that compilers started to add it because of some AMD issue and now I find it in every code generated by GCC but I think that is useless.

aw27 · November 17, 2017, 12:54:10 AM

Quote from: fylux on November 17, 2017, 12:42:44 AM
Can you change it to return the number of iterations in the main instead of a print? It's to make it easier to measure. This way maybe I can try to compile it on Linux.
In the worst case since is pretty similar to mine but changing the cmove with branches maybe I can program something equivalent from scratch.

integer are returned in eax. Now, is your turn.
PS: Conditional move ASM instructions are probably slower than you are expecting or wishing. Last time I tested they were.

The MASM Forum

News:

Optimizing Collatz

fylux

Jokaste

jj2007

aw27

fylux

fylux

jj2007

Jokaste

fylux

fylux

aw27

Jokaste

fylux

fylux

aw27