News:

Masm32 SDK description, downloads and other helpful links
Message to All Guests
NB: Posting URL's See here: Posted URL Change

Main Menu

Optimizing Collatz

Started by fylux, November 16, 2017, 10:32:08 PM

Previous topic - Next topic

nidud

#15
deleted

hutch--

See if you can find an AMD optimisation guide as AMD hardware usually does things differently to Intel hardware and the general optimisations used with an Intel processor may not be optimal for an AMD processor.

fylux

Thank you very much for the test suit nidud!

This are my results:

-- (1000000)
   159518 cycles, rep(300), code( 32) 1.asm: algorithm 1
   153129 cycles, rep(300), code( 32) 2.asm: algorithm 2
   157345 cycles, rep(300), code( 32) 3.asm: algorithm 3
   153764 cycles, rep(300), code( 32) 4.asm: algorithm 4
-- (1000001)
   116474 cycles, rep(300), code( 32) 1.asm: algorithm 1
   116084 cycles, rep(300), code( 32) 2.asm: algorithm 2
   115590 cycles, rep(300), code( 32) 3.asm: algorithm 3
   115997 cycles, rep(300), code( 32) 4.asm: algorithm 4

total [1000000 .. 1000001], 1++
   269213 cycles 2.asm: algorithm 2
   269761 cycles 4.asm: algorithm 4
   272935 cycles 3.asm: algorithm 3
   275992 cycles 1.asm: algorithm 1
hit any key to continue...


Does it means that I can change the code of the asm files and run the same cmd program to test it?

fylux

Quote from: hutch-- on November 17, 2017, 01:16:24 AM
See if you can find an AMD optimisation guide as AMD hardware usually does things differently to Intel hardware and the general optimisations used with an Intel processor may not be optimal for an AMD processor.

Well, we can use the optimization guide of Agner that include both AMD and Intel. However, I decided to start optimizing it for Skylake to make the task simpler.

nidud

#19
deleted

fylux

Thank you very much for the clarification.

I'm trying to do a variation with branches as some users proposed:

.x64
.model flat
.code
L2:
    and    ecx,1
    jz    @EVEN
    lea     rcx,[rcx+rcx*2+1]
    jmp   @ENDIF
@EVEN:
    shr     rcx,1
@ENDIF:
    add     rdx,1
    cmp     rcx,1
    jne     L2
    ret

    end


However this code doesn't work, so I'm trying to figure out the reason.

nidud

#21
deleted

fylux

Good point, I just made it run with "test" but I needed to move the register to eax.


.x64
.model flat
.code

mov rax,rcx
L2:
    test    rax,1
    jz   @EVEN
    lea     rax,[rax+rax*2+1]
    jmp  @ENDIF
@EVEN:
    shr     rax,1
@ENDIF:
    inc     rdx
    cmp     rax,1
    jne     L2
    ret

    end


fylux

Finally I've considered the 3 alternatives:

1º(branch modification)

mov rax,rcx
L2:
    test    rax,1
    jz     @EVEN
    lea     rax,[rax+rax*2+1]
    jmp     @ENDIF
@EVEN:
    shr     rax,1
@ENDIF:
    inc     rdx
    cmp     rax,1
    jne     L2
    ret

    end



2º (slightly modification)

L2:
    mov     r8,rcx
    lea     rax,[rcx+rcx*2+1]
    shr     r8,1
    test    rcx,1
    cmove   rax,r8
    inc     rdx
    cmp     rax,1
    mov     rcx,rax
    jne     L2
    ret

    end


3º (the one that nidud attached)

L2:
    mov     r8,rcx
    lea     rax,[rcx+rcx*2+1]
    shr     r8,1
    and     ecx,1
    cmove   rax,r8
    add     rdx,1
    cmp     rax,1
    mov     rcx,rax
    jne     L2
    ret

    end


If I run the tests my conclusion is that the results are the same. Although in this example you can see small differences it change along test but in conclusion they perform the same.

Intel(R) Core(TM) i7-6700HQ CPU @ 2.60GHz (AVX2)
----------------------------------------------
-- (1000000)
   194733 cycles, rep(300), code( 31) 1.asm: algorithm 1
   182347 cycles, rep(300), code( 35) 2.asm: algorithm 2
   179397 cycles, rep(300), code( 32) 3.asm: algorithm 3
-- (1000001)
   143990 cycles, rep(300), code( 31) 1.asm: algorithm 1
   134991 cycles, rep(300), code( 35) 2.asm: algorithm 2
   180371 cycles, rep(300), code( 32) 3.asm: algorithm 3

total [1000000 .. 1000001], 1++
   317338 cycles 2.asm: algorithm 2
   338723 cycles 1.asm: algorithm 1
   359768 cycles 3.asm: algorithm 3
hit any key to continue...


Therefore I still wonder if the code can be improved.

aw27


Don't jump to immediate conclusions, I am with Hutch on this one (cycles are misleading)
http://masm32.com/board/index.php?topic=6666.msg71484#msg71484

nidud

#25
deleted

nidud

#26
deleted

fylux

Quote from: aw27 on November 17, 2017, 03:01:59 AM

Don't jump to immediate conclusions, I am with Hutch on this one (cycles are misleading)
http://masm32.com/board/index.php?topic=6666.msg71484#msg71484

It's true, to measure time inside the code I will need to use chronos with nanoseconds in C++. I will have to generate it with the compiler and then add the changes that we have talked about in GAS.

fylux

Quote from: nidud on November 17, 2017, 03:08:22 AM
Jumps are usually very slow so if you could do without, even by adding 5-6 instructions, it's usually faster.

Quote from: fylux on November 17, 2017, 02:39:57 AM
Therefore I still wonder if the code can be improved.

always  :biggrin:


    add     rdx,1
    lea     rax,[rcx+rcx*2+1]
    shr     rcx,1
    cmovc   rcx,rax
    cmp     rcx,1
    jne     L2


This modification of the code looks better because you avoid some moves but unfortuntely I get this error:
error: rax = 7 (1) 2.asm
hit any key to continue...


By the way, asmc is an emulator or what is exactly?

And the most important.
It's true what you said about branches but it depends on the instructions. Because I can make something like:
n = (3n+1)*least significant bit + (n/2)*(1-least significant bit)
And it doesn't require branches. However, I would say that I tried it and was slower, but I cannot tell it for sure.

Thanks.

PS: Very interesting you benchmark about instructions. According to it bt should be faster than and for example.

nidud

#29
deleted