News:

Masm32 SDK description, downloads and other helpful links
Message to All Guests
NB: Posting URL's See here: Posted URL Change

Main Menu

Optimizing Collatz

Started by fylux, November 16, 2017, 10:32:08 PM

Previous topic - Next topic

fylux

Hi everybody!
First of all excuse in advance if this topic is not advanced enough for this section of the forum. Having said that this is the idea.

Collatz conjeture is based on the following idea:
-Take a number
-If it's even divide it by 2
-If it's odd multiply it by 3 and add 1
-Repeat the process until you reach 1

I won't enter into more detail because is not necessary. Now, what I'm trying to do is to find the fastest algorithm to find how many steps it takes to reach 1. The platform where I want to optimize it is Skylake, but I don't think that Skylake will have any advantage compare with other new architectures for this code.

Well, the fastest code that I've found is the following:

.L2:
   movq  %rdx, %rsi
   leaq  1(%rdx,%rdx,2), %rcx
   shrq  %rsi
   andl  $1, %edx
   cmove  %rsi, %rcx
   addq  $1, %rax
   cmpq  $1, %rcx
   movq  %rcx, %rdx
   jne  .L2


In this code there are a few tricks like use leaq or conditional move (things that I didn't know that exist a few months ago). However I wonder if this part of the code can be further improved. Because it is fully sequential it cannot be parallelized neither vectorized. My only idea is to look for other instructions with lower latency or to find if there is any stall in the code since for example a conditional move is also a branch.

Thank you in advance.

PS: Find attached the whole asm program.

Jokaste

Quote
-Take a number.

-If it's even divide it by 2
-If it's odd multiply it by 3 and add 1

If it is even bit 0 = 0 else bit 0 = 1
replace

Quoteaddq  $1, %rax
with
inc  %raxAlign label L2
I don't understand this syntax.
Quoteandl  $1, %edx
Why do you don't use BTx if you always test the bit 1.

Quotecmpq  $1, %rcxmovq  %rcx, %rdxjne  .L2
A suggestion


dec rcx
movq  %rcx, %rdx
jnz L2


My proposal


  [00000001400017A7] 41BA0D000000                 mov               r10d,D
  [00000001400017AD] B901000000                   mov               ecx,1
  [00000001400017B2] EB0C                         jmp               00000001400017C0
  [00000001400017B4] 66666690                     nop               
  [00000001400017B8] 66666690                     nop               
  [00000001400017BC] 66666690                     nop               
  [00000001400017C0] 4939CA                       cmp               r10,rcx
  [00000001400017C3] 742B                         je                00000001400017F0
  [00000001400017C5] 490FBAE200                   bt                r10,0
  [00000001400017CA] 7314                         jae               00000001400017E0
  [00000001400017CC] EB02                         jmp               00000001400017D0
  [00000001400017CE] 6690                         nop               
  [00000001400017D0] 4F8D1452                     lea               r10,[r10+r10*2]
  [00000001400017D4] 49FFC2                       inc               r10
  [00000001400017D7] EBE7                         jmp               00000001400017C0
  [00000001400017D9] 66666690                     nop               
  [00000001400017DD] 666690                       nop               
  [00000001400017E0] 49D1EA                       shr               r10,1
  [00000001400017E3] 75DB                         jne               00000001400017C0
  [00000001400017E5] 66666690                     nop               
  [00000001400017E9] 66666690                     nop               
  [00000001400017ED] 666690                       nop               @Finished :
Kenavo
---------------------------
Grincheux / Jokaste

jj2007

There is a related post here: http://masm32.com/board/index.php?topic=4165.msg44290#msg44290
Your code looks odd - is that GAS syntax? I would strongly suggest to rewrite it for Masm, as nobody here is familiar with GAS.

Welcome to the forum :icon14:

aw27

fylux,

May be someone here can improve on that. However, people here does not deal much with GAS and the System V ABI.

fylux

Jokaste, thank you very much for the proposals.

However, I have a few doubts.

Nice note about using inc. I will research about the performance difference of those add and inc.

Sorry but I don't know yet what BTx is, therefore I will look for to understand you are proposing because it's true that I only use the bit 1.
EDIT: I found about it, and it looks more suitable but in the guide of Agner fog about instructions in Skylake all the stats are the same for "and" and "bt" but the Reciprocal through put which is 0.25 with and, 0.5 with bt.

About you suggestion of dec and jnz I think that there is a problem (maybe I'm wrong). If you decrement rcx and then you move it to rdx it will change the result since in every iteration I will be decrementing 1.


fylux

jj2007 and aw27, Nice you meet you.

Indeed this is GAS code, because I don't know enough about how to write ASM code by myself so I just tweak the code generated by GCC. I understand that many people will have problem working with GAS but I thought that for a simple example like this it wouldn't be a problem.

jj2007

Quote from: fylux on November 16, 2017, 11:18:44 PMIndeed this is GAS code, because I don't know enough about how to write ASM code by myself so I just tweak the code generated by GCC.

What do you get if you use gcc -S -masm=intel test.c ?

Jokaste

FyLux you are right this a reason I wrote an entire code... and tested it. I use PoAsm assembler. It'c a C compiler and a 64 bits assembler too.
Kenavo
---------------------------
Grincheux / Jokaste

fylux

Quote from: jj2007 on November 16, 2017, 11:34:58 PM
Quote from: fylux on November 16, 2017, 11:18:44 PMIndeed this is GAS code, because I don't know enough about how to write ASM code by myself so I just tweak the code generated by GCC.

What do you get if you use gcc -S -masm=intel test.c ?

Good idea. If I use the equivalent command:
g++ -S -masm=intel collatz.cpp -std=c++11 -O3
I get this:

.LFB0:
.cfi_startproc
mov edx, 1000000
mov eax, 1
.p2align 4,,10
.p2align 3
.L2:
mov rsi, rdx
lea rcx, [rdx+1+rdx*2]
shr rsi
and edx, 1
cmove rcx, rsi
add rax, 1
cmp rcx, 1
mov rdx, rcx
jne .L2
rep ret

fylux

Quote from: Jokaste on November 16, 2017, 11:43:28 PM
FyLux you are right this a reason I wrote an entire code... and tested it. I use PoAsm assembler. It'c a C compiler and a 64 bits assembler too.

Thank you very much for taking the effort of writing an entire code.  Your code it is more elegant but I still wonder if it is faster. BTW remember that you still have to add a counter to measure the number of iterations.
I don't know if it is better to rely on branches or on cmove. The other change is the BT and I'm not sure yet if it will affect performance. And about the  instruction order I wonder if it is imporant. In my approach I compute both possibilities before the branch so you may think that is less efficient. But I think that if you have to wait until the branch is solved maybe you are loosing time and it's better to precalculate both possiblities. However because this is Out of Order processor maybe it already figures out by itself.

What do you think?  I wish I knew any program in Linux that shows the stalls between instructions.

BTW Maybe now you want to check my code generated in MASM style.

aw27

Let me know how this fares:


;\masm32\bin\ml64  -c -Zp8 coll.asm
;\masm32\bin\link /ENTRY:main /SUBSYSTEM:console /MACHINE:X64 coll.obj

includelib \masm32\lib64\msvcrt.lib
printf PROTO :PTR, :VARARG
includelib \masm32\lib64\kernel32.lib
ExitProcess PROTO :DWORD

.data
myNumber dq 07eeeeeeeeeeeeeeeh
fmt db "Number of iterations: %d",10,0

.code

main proc
sub rsp, 28h
mov rdx,0
mov rax, myNumber
@loop:
test rax, 1
jz @divBy2
lea rax, QWORD PTR [rax+rax*2+1]
jmp short @comp
@divBy2:
shr rax, 1
@comp:
inc rdx
cmp rax, 1
jne @loop

lea rcx, fmt
call printf

mov ecx, 0
call ExitProcess
main endp

end


Number of iteration:365

Jokaste

Quoterep ret[/qote]
or
Quotenop
rep


I read about this some days ago, Amd or Intel gives a macro for REP RET because some assembler don't allow REP prefix before RET. The problem in that case is that i am not sure the correct EPILOG is written.
Kenavo
---------------------------
Grincheux / Jokaste

fylux

Quote from: aw27 on November 16, 2017, 11:53:31 PM
Let me know how this fares:


...


Number of iteration:365

Can you change it to return the number of iterations in the main instead of a print? It's to make it easier to measure. This way maybe I can try to compile it on Linux.
In the worst case since is pretty similar to mine but changing the cmove with branches maybe I can program something equivalent from scratch.

fylux

Quote from: Jokaste on November 16, 2017, 11:54:26 PM
Quoterep ret[/qote]
or
Quotenop
rep


I read about this some days ago, Amd or Intel gives a macro for REP RET because some assembler don't allow REP prefix before RET. The problem in that case is that i am not sure the correct EPILOG is written.

I'm not sure about what do you mean but I think that compilers started to add it because of some AMD issue and now I find it in every code generated by GCC but I think that is useless.

aw27

Quote from: fylux on November 17, 2017, 12:42:44 AM
Can you change it to return the number of iterations in the main instead of a print? It's to make it easier to measure. This way maybe I can try to compile it on Linux.
In the worst case since is pretty similar to mine but changing the cmove with branches maybe I can program something equivalent from scratch.

integer are returned in eax. Now, is your turn.
PS: Conditional move ASM instructions are probably slower than you are expecting or wishing. Last time I tested they were.