The MASM Forum

General => The Laboratory => Topic started by: fylux on November 16, 2017, 10:32:08 PM

Title: Optimizing Collatz
Post by: fylux on November 16, 2017, 10:32:08 PM
Hi everybody!
First of all excuse in advance if this topic is not advanced enough for this section of the forum. Having said that this is the idea.

Collatz conjeture is based on the following idea:
-Take a number
-If it's even divide it by 2
-If it's odd multiply it by 3 and add 1
-Repeat the process until you reach 1

I won't enter into more detail because is not necessary. Now, what I'm trying to do is to find the fastest algorithm to find how many steps it takes to reach 1. The platform where I want to optimize it is Skylake, but I don't think that Skylake will have any advantage compare with other new architectures for this code.

Well, the fastest code that I've found is the following:

.L2:
   movq  %rdx, %rsi
   leaq  1(%rdx,%rdx,2), %rcx
   shrq  %rsi
   andl  $1, %edx
   cmove  %rsi, %rcx
   addq  $1, %rax
   cmpq  $1, %rcx
   movq  %rcx, %rdx
   jne  .L2


In this code there are a few tricks like use leaq or conditional move (things that I didn't know that exist a few months ago). However I wonder if this part of the code can be further improved. Because it is fully sequential it cannot be parallelized neither vectorized. My only idea is to look for other instructions with lower latency or to find if there is any stall in the code since for example a conditional move is also a branch.

Thank you in advance.

PS: Find attached the whole asm program.
Title: Re: Optimizing Collatz
Post by: Jokaste on November 16, 2017, 10:50:40 PM
Quote
-Take a number.

-If it's even divide it by 2
-If it's odd multiply it by 3 and add 1

If it is even bit 0 = 0 else bit 0 = 1
replace

Quoteaddq  $1, %rax
with
inc  %raxAlign label L2
I don't understand this syntax.
Quoteandl  $1, %edx
Why do you don't use BTx if you always test the bit 1.

Quotecmpq  $1, %rcxmovq  %rcx, %rdxjne  .L2
A suggestion


dec rcx
movq  %rcx, %rdx
jnz L2


My proposal


  [00000001400017A7] 41BA0D000000                 mov               r10d,D
  [00000001400017AD] B901000000                   mov               ecx,1
  [00000001400017B2] EB0C                         jmp               00000001400017C0
  [00000001400017B4] 66666690                     nop               
  [00000001400017B8] 66666690                     nop               
  [00000001400017BC] 66666690                     nop               
  [00000001400017C0] 4939CA                       cmp               r10,rcx
  [00000001400017C3] 742B                         je                00000001400017F0
  [00000001400017C5] 490FBAE200                   bt                r10,0
  [00000001400017CA] 7314                         jae               00000001400017E0
  [00000001400017CC] EB02                         jmp               00000001400017D0
  [00000001400017CE] 6690                         nop               
  [00000001400017D0] 4F8D1452                     lea               r10,[r10+r10*2]
  [00000001400017D4] 49FFC2                       inc               r10
  [00000001400017D7] EBE7                         jmp               00000001400017C0
  [00000001400017D9] 66666690                     nop               
  [00000001400017DD] 666690                       nop               
  [00000001400017E0] 49D1EA                       shr               r10,1
  [00000001400017E3] 75DB                         jne               00000001400017C0
  [00000001400017E5] 66666690                     nop               
  [00000001400017E9] 66666690                     nop               
  [00000001400017ED] 666690                       nop               @Finished :
Title: Re: Optimizing Collatz
Post by: jj2007 on November 16, 2017, 11:01:39 PM
There is a related post here: http://masm32.com/board/index.php?topic=4165.msg44290#msg44290
Your code looks odd - is that GAS syntax? I would strongly suggest to rewrite it for Masm, as nobody here is familiar with GAS.

Welcome to the forum :icon14:
Title: Re: Optimizing Collatz
Post by: aw27 on November 16, 2017, 11:12:38 PM
fylux,

May be someone here can improve on that. However, people here does not deal much with GAS and the System V ABI.
Title: Re: Optimizing Collatz
Post by: fylux on November 16, 2017, 11:15:53 PM
Jokaste, thank you very much for the proposals.

However, I have a few doubts.

Nice note about using inc. I will research about the performance difference of those add and inc.

Sorry but I don't know yet what BTx is, therefore I will look for to understand you are proposing because it's true that I only use the bit 1.
EDIT: I found about it, and it looks more suitable but in the guide of Agner fog about instructions in Skylake all the stats are the same for "and" and "bt" but the Reciprocal through put which is 0.25 with and, 0.5 with bt.

About you suggestion of dec and jnz I think that there is a problem (maybe I'm wrong). If you decrement rcx and then you move it to rdx it will change the result since in every iteration I will be decrementing 1.

Title: Re: Optimizing Collatz
Post by: fylux on November 16, 2017, 11:18:44 PM
jj2007 and aw27, Nice you meet you.

Indeed this is GAS code, because I don't know enough about how to write ASM code by myself so I just tweak the code generated by GCC. I understand that many people will have problem working with GAS but I thought that for a simple example like this it wouldn't be a problem.
Title: Re: Optimizing Collatz
Post by: jj2007 on November 16, 2017, 11:34:58 PM
Quote from: fylux on November 16, 2017, 11:18:44 PMIndeed this is GAS code, because I don't know enough about how to write ASM code by myself so I just tweak the code generated by GCC.

What do you get if you use gcc -S -masm=intel test.c ?
Title: Re: Optimizing Collatz
Post by: Jokaste on November 16, 2017, 11:43:28 PM
FyLux you are right this a reason I wrote an entire code... and tested it. I use PoAsm assembler (http://www.smorgasbordet.com/pellesc/). It'c a C compiler and a 64 bits assembler too.
Title: Re: Optimizing Collatz
Post by: fylux on November 16, 2017, 11:45:15 PM
Quote from: jj2007 on November 16, 2017, 11:34:58 PM
Quote from: fylux on November 16, 2017, 11:18:44 PMIndeed this is GAS code, because I don't know enough about how to write ASM code by myself so I just tweak the code generated by GCC.

What do you get if you use gcc -S -masm=intel test.c ?

Good idea. If I use the equivalent command:
g++ -S -masm=intel collatz.cpp -std=c++11 -O3
I get this:

.LFB0:
.cfi_startproc
mov edx, 1000000
mov eax, 1
.p2align 4,,10
.p2align 3
.L2:
mov rsi, rdx
lea rcx, [rdx+1+rdx*2]
shr rsi
and edx, 1
cmove rcx, rsi
add rax, 1
cmp rcx, 1
mov rdx, rcx
jne .L2
rep ret
Title: Re: Optimizing Collatz
Post by: fylux on November 16, 2017, 11:50:57 PM
Quote from: Jokaste on November 16, 2017, 11:43:28 PM
FyLux you are right this a reason I wrote an entire code... and tested it. I use PoAsm assembler (http://www.smorgasbordet.com/pellesc/). It'c a C compiler and a 64 bits assembler too.

Thank you very much for taking the effort of writing an entire code.  Your code it is more elegant but I still wonder if it is faster. BTW remember that you still have to add a counter to measure the number of iterations.
I don't know if it is better to rely on branches or on cmove. The other change is the BT and I'm not sure yet if it will affect performance. And about the  instruction order I wonder if it is imporant. In my approach I compute both possibilities before the branch so you may think that is less efficient. But I think that if you have to wait until the branch is solved maybe you are loosing time and it's better to precalculate both possiblities. However because this is Out of Order processor maybe it already figures out by itself.

What do you think?  I wish I knew any program in Linux that shows the stalls between instructions.

BTW Maybe now you want to check my code generated in MASM style.
Title: Re: Optimizing Collatz
Post by: aw27 on November 16, 2017, 11:53:31 PM
Let me know how this fares:


;\masm32\bin\ml64  -c -Zp8 coll.asm
;\masm32\bin\link /ENTRY:main /SUBSYSTEM:console /MACHINE:X64 coll.obj

includelib \masm32\lib64\msvcrt.lib
printf PROTO :PTR, :VARARG
includelib \masm32\lib64\kernel32.lib
ExitProcess PROTO :DWORD

.data
myNumber dq 07eeeeeeeeeeeeeeeh
fmt db "Number of iterations: %d",10,0

.code

main proc
sub rsp, 28h
mov rdx,0
mov rax, myNumber
@loop:
test rax, 1
jz @divBy2
lea rax, QWORD PTR [rax+rax*2+1]
jmp short @comp
@divBy2:
shr rax, 1
@comp:
inc rdx
cmp rax, 1
jne @loop

lea rcx, fmt
call printf

mov ecx, 0
call ExitProcess
main endp

end


Number of iteration:365
Title: Re: Optimizing Collatz
Post by: Jokaste on November 16, 2017, 11:54:26 PM
Quoterep ret[/qote]
or
Quotenop
rep


I read about this some days ago, Amd or Intel gives a macro for REP RET because some assembler don't allow REP prefix before RET. The problem in that case is that i am not sure the correct EPILOG is written.
Title: Re: Optimizing Collatz
Post by: fylux on November 17, 2017, 12:42:44 AM
Quote from: aw27 on November 16, 2017, 11:53:31 PM
Let me know how this fares:


...


Number of iteration:365

Can you change it to return the number of iterations in the main instead of a print? It's to make it easier to measure. This way maybe I can try to compile it on Linux.
In the worst case since is pretty similar to mine but changing the cmove with branches maybe I can program something equivalent from scratch.
Title: Re: Optimizing Collatz
Post by: fylux on November 17, 2017, 12:43:45 AM
Quote from: Jokaste on November 16, 2017, 11:54:26 PM
Quoterep ret[/qote]
or
Quotenop
rep


I read about this some days ago, Amd or Intel gives a macro for REP RET because some assembler don't allow REP prefix before RET. The problem in that case is that i am not sure the correct EPILOG is written.

I'm not sure about what do you mean but I think that compilers started to add it because of some AMD issue and now I find it in every code generated by GCC but I think that is useless.
Title: Re: Optimizing Collatz
Post by: aw27 on November 17, 2017, 12:54:10 AM
Quote from: fylux on November 17, 2017, 12:42:44 AM
Can you change it to return the number of iterations in the main instead of a print? It's to make it easier to measure. This way maybe I can try to compile it on Linux.
In the worst case since is pretty similar to mine but changing the cmove with branches maybe I can program something equivalent from scratch.

integer are returned in eax. Now, is your turn.
PS: Conditional move ASM instructions are probably slower than you are expecting or wishing. Last time I tested they were.


Title: Re: Optimizing Collatz
Post by: nidud on November 17, 2017, 01:01:58 AM
deleted
Title: Re: Optimizing Collatz
Post by: hutch-- on November 17, 2017, 01:16:24 AM
See if you can find an AMD optimisation guide as AMD hardware usually does things differently to Intel hardware and the general optimisations used with an Intel processor may not be optimal for an AMD processor.
Title: Re: Optimizing Collatz
Post by: fylux on November 17, 2017, 01:20:51 AM
Thank you very much for the test suit nidud!

This are my results:

-- (1000000)
   159518 cycles, rep(300), code( 32) 1.asm: algorithm 1
   153129 cycles, rep(300), code( 32) 2.asm: algorithm 2
   157345 cycles, rep(300), code( 32) 3.asm: algorithm 3
   153764 cycles, rep(300), code( 32) 4.asm: algorithm 4
-- (1000001)
   116474 cycles, rep(300), code( 32) 1.asm: algorithm 1
   116084 cycles, rep(300), code( 32) 2.asm: algorithm 2
   115590 cycles, rep(300), code( 32) 3.asm: algorithm 3
   115997 cycles, rep(300), code( 32) 4.asm: algorithm 4

total [1000000 .. 1000001], 1++
   269213 cycles 2.asm: algorithm 2
   269761 cycles 4.asm: algorithm 4
   272935 cycles 3.asm: algorithm 3
   275992 cycles 1.asm: algorithm 1
hit any key to continue...


Does it means that I can change the code of the asm files and run the same cmd program to test it?
Title: Re: Optimizing Collatz
Post by: fylux on November 17, 2017, 01:25:51 AM
Quote from: hutch-- on November 17, 2017, 01:16:24 AM
See if you can find an AMD optimisation guide as AMD hardware usually does things differently to Intel hardware and the general optimisations used with an Intel processor may not be optimal for an AMD processor.

Well, we can use the optimization guide of Agner that include both AMD and Intel. However, I decided to start optimizing it for Skylake to make the task simpler.
Title: Re: Optimizing Collatz
Post by: nidud on November 17, 2017, 01:37:07 AM
deleted
Title: Re: Optimizing Collatz
Post by: fylux on November 17, 2017, 01:53:04 AM
Thank you very much for the clarification.

I'm trying to do a variation with branches as some users proposed:

.x64
.model flat
.code
L2:
    and    ecx,1
    jz    @EVEN
    lea     rcx,[rcx+rcx*2+1]
    jmp   @ENDIF
@EVEN:
    shr     rcx,1
@ENDIF:
    add     rdx,1
    cmp     rcx,1
    jne     L2
    ret

    end


However this code doesn't work, so I'm trying to figure out the reason.
Title: Re: Optimizing Collatz
Post by: nidud on November 17, 2017, 02:22:32 AM
deleted
Title: Re: Optimizing Collatz
Post by: fylux on November 17, 2017, 02:28:41 AM
Good point, I just made it run with "test" but I needed to move the register to eax.


.x64
.model flat
.code

mov rax,rcx
L2:
    test    rax,1
    jz   @EVEN
    lea     rax,[rax+rax*2+1]
    jmp  @ENDIF
@EVEN:
    shr     rax,1
@ENDIF:
    inc     rdx
    cmp     rax,1
    jne     L2
    ret

    end

Title: Re: Optimizing Collatz
Post by: fylux on November 17, 2017, 02:39:57 AM
Finally I've considered the 3 alternatives:

1º(branch modification)

mov rax,rcx
L2:
    test    rax,1
    jz     @EVEN
    lea     rax,[rax+rax*2+1]
    jmp     @ENDIF
@EVEN:
    shr     rax,1
@ENDIF:
    inc     rdx
    cmp     rax,1
    jne     L2
    ret

    end



2º (slightly modification)

L2:
    mov     r8,rcx
    lea     rax,[rcx+rcx*2+1]
    shr     r8,1
    test    rcx,1
    cmove   rax,r8
    inc     rdx
    cmp     rax,1
    mov     rcx,rax
    jne     L2
    ret

    end


3º (the one that nidud attached)

L2:
    mov     r8,rcx
    lea     rax,[rcx+rcx*2+1]
    shr     r8,1
    and     ecx,1
    cmove   rax,r8
    add     rdx,1
    cmp     rax,1
    mov     rcx,rax
    jne     L2
    ret

    end


If I run the tests my conclusion is that the results are the same. Although in this example you can see small differences it change along test but in conclusion they perform the same.

Intel(R) Core(TM) i7-6700HQ CPU @ 2.60GHz (AVX2)
----------------------------------------------
-- (1000000)
   194733 cycles, rep(300), code( 31) 1.asm: algorithm 1
   182347 cycles, rep(300), code( 35) 2.asm: algorithm 2
   179397 cycles, rep(300), code( 32) 3.asm: algorithm 3
-- (1000001)
   143990 cycles, rep(300), code( 31) 1.asm: algorithm 1
   134991 cycles, rep(300), code( 35) 2.asm: algorithm 2
   180371 cycles, rep(300), code( 32) 3.asm: algorithm 3

total [1000000 .. 1000001], 1++
   317338 cycles 2.asm: algorithm 2
   338723 cycles 1.asm: algorithm 1
   359768 cycles 3.asm: algorithm 3
hit any key to continue...


Therefore I still wonder if the code can be improved.
Title: Re: Optimizing Collatz
Post by: aw27 on November 17, 2017, 03:01:59 AM

Don't jump to immediate conclusions, I am with Hutch on this one (cycles are misleading)
http://masm32.com/board/index.php?topic=6666.msg71484#msg71484
Title: Re: Optimizing Collatz
Post by: nidud on November 17, 2017, 03:08:22 AM
deleted
Title: Re: Optimizing Collatz
Post by: nidud on November 17, 2017, 03:26:49 AM
deleted
Title: Re: Optimizing Collatz
Post by: fylux on November 17, 2017, 03:32:08 AM
Quote from: aw27 on November 17, 2017, 03:01:59 AM

Don't jump to immediate conclusions, I am with Hutch on this one (cycles are misleading)
http://masm32.com/board/index.php?topic=6666.msg71484#msg71484

It's true, to measure time inside the code I will need to use chronos with nanoseconds in C++. I will have to generate it with the compiler and then add the changes that we have talked about in GAS.
Title: Re: Optimizing Collatz
Post by: fylux on November 17, 2017, 03:36:25 AM
Quote from: nidud on November 17, 2017, 03:08:22 AM
Jumps are usually very slow so if you could do without, even by adding 5-6 instructions, it's usually faster.

Quote from: fylux on November 17, 2017, 02:39:57 AM
Therefore I still wonder if the code can be improved.

always  :biggrin:


    add     rdx,1
    lea     rax,[rcx+rcx*2+1]
    shr     rcx,1
    cmovc   rcx,rax
    cmp     rcx,1
    jne     L2


This modification of the code looks better because you avoid some moves but unfortuntely I get this error:
error: rax = 7 (1) 2.asm
hit any key to continue...


By the way, asmc is an emulator or what is exactly?

And the most important.
It's true what you said about branches but it depends on the instructions. Because I can make something like:
n = (3n+1)*least significant bit + (n/2)*(1-least significant bit)
And it doesn't require branches. However, I would say that I tried it and was slower, but I cannot tell it for sure.

Thanks.

PS: Very interesting you benchmark about instructions. According to it bt should be faster than and for example.
Title: Re: Optimizing Collatz
Post by: nidud on November 17, 2017, 03:49:22 AM
deleted
Title: Re: Optimizing Collatz
Post by: aw27 on November 17, 2017, 03:56:56 AM
Code timing estimation should only be done with a... timer. Cache, branch prediction, pipelining, out-of-order execution, may weight much more than instructions per cycle or latency.
Title: Re: Optimizing Collatz
Post by: jj2007 on November 17, 2017, 04:42:38 AM
Quote from: fylux on November 17, 2017, 03:36:25 AMbt should be faster than and for example.

Time it...670 ms for empty loop
1358 ms for test al, 1
1497 ms for bt rax, 0
1373 ms for and al, 1

671 ms for empty loop
1341 ms for test al, 1
1482 ms for bt rax, 0
1358 ms for and al, 1

655 ms for empty loop
1357 ms for test al, 1
1482 ms for bt rax, 0
1373 ms for and al, 1


See 64-bit assembly with RichMasm (http://masm32.com/board/index.php?topic=5314.msg59884#msg59884) if you want to build the source. Note that there is loop overhead included, so bt is really pretty slow on this Intel Core i5...
Title: Re: Optimizing Collatz
Post by: nidud on November 17, 2017, 05:03:36 AM
deleted
Title: Re: Optimizing Collatz
Post by: jj2007 on November 17, 2017, 05:12:29 AM
Interesting. How do you explain that and eax, 1 is a factor 4 slower than test eax, 1? In my tests above they are equally fast.
Title: Re: Optimizing Collatz
Post by: nidud on November 17, 2017, 05:26:58 AM
deleted
Title: Re: Optimizing Collatz
Post by: johnsa on November 21, 2017, 02:02:04 AM
I believe the difference between AND and TEST is exactly that, there is a lot of avoided internal logic with TEST as the register contents is not modified and only the flags, this also has an impact on out of order execution in that the CPU is aware that the instruction won't affect the contents of the register.