Author Topic: Maximum loop  (Read 1027 times)

mabdelouahab

  • Member
  • ***
  • Posts: 404
Maximum loop
« on: September 08, 2018, 09:36:00 PM »
In my PC (Intel Core I5), I wanted to try the maximum possible loop using a registry
32 bit, this loop takes 2600 milliseconds (4 294 967 295 times):
Code: [Select]
xor eax,eax
@@: dec eax
jnz @B

64 bit, I did not try it, but according to the calculation, this loop takes 350 years (18 446 744 065 119 617 025 times):
Code: [Select]
xor rax,rax
@@: dec rax
jnz @B

jj2007

  • Member
  • *****
  • Posts: 9514
  • Assembler is fun ;-)
    • MasmBasic
Re: Maximum loop
« Reply #1 on: September 08, 2018, 09:50:37 PM »
My i5 is a bit faster... I wonder why:
Code: [Select]
Intel(R) Core(TM) i5-2450M CPU @ 2.50GHz
1447 ms

Try running the loop three times. Sometimes it takes the cpu a while to arrive a full speed.

mabdelouahab

  • Member
  • ***
  • Posts: 404
Re: Maximum loop
« Reply #2 on: September 08, 2018, 10:02:57 PM »
Hi JJ
In your device the second loop takes 197 years
My I5 is : Intel(R) Core(TM) i5-4210M CPU @ 1.70GHz 2.40GHz

LiaoMi

  • Member
  • ****
  • Posts: 509
Re: Maximum loop
« Reply #3 on: September 08, 2018, 10:12:07 PM »
 :biggrin:

What if I use multithreading? You can split the task  ;)
The noploop CPU Benchmark http://www.brendangregg.com/blog/2014-04-26/the-noploop-cpu-benchmark.html

What happens if I make a loopunrolling ?! https://en.wikipedia.org/wiki/Loop_unrolling

mabdelouahab

  • Member
  • ***
  • Posts: 404
Re: Maximum loop
« Reply #4 on: September 09, 2018, 02:50:57 AM »
40% profit when using multithreading

Quote
18344ms for 10 * 32bits reg loop
10875ms for 10 thread of 32bits reg loop

Code: [Select]
OPTION DOTNAME
OPTION LITERALS:ON
   
.includelib macro lib_:VARARG
FOR lib__,<lib_>
INCLUDELIB \masm32\lib64\&lib__&.LIB
endm
endm
.proto macro proc_:VARARG
FOR proc__,<proc_>
&proc__& proto :vararg
endm
endm
  WKey MACRO
  LOCAL lbl
  lbl:
  call _kbhit
    test rax, rax
    jz lbl
  endm
.includelib kernel32,msvcrt
.proto ExitProcess,printf,GetTickCount,CreateThread,_kbhit
.data
g db 13,10,0
f db 0
f2 db 0
.code

MyThread proc
xor eax,eax
@@: dec eax
jnz @B
inc f
ret
MyThread endp
MyThread2 proc
xor bx,bx
xor ax,ax
@@: dec ax
jnz @B
dec bx
jnz @B
inc f2
ret
MyThread2 endp


entry_point proc
LOCAL Start_Time:qword
LOCAL T_Time:qword
;***************************************************************
invoke GetTickCount
MOV Start_Time,RAX
;---------------------------------------------
call MyThread
call MyThread
call MyThread
call MyThread
call MyThread
call MyThread
call MyThread
call MyThread
call MyThread
call MyThread
;---------------------------------------------
invoke GetTickCount
SUB RAX,Start_Time
MOV T_Time,RAX
invoke printf,"%dms for 10 * 32bits reg loop ",T_Time
; ;***************************************************************
; invoke GetTickCount
; MOV Start_Time,RAX
; ;---------------------------------------------
; call MyThread2
; call MyThread2
; call MyThread2
; call MyThread2
; call MyThread2
; call MyThread2
; call MyThread2
; call MyThread2
; call MyThread2
; call MyThread2
; ;---------------------------------------------
; invoke GetTickCount
; SUB RAX,Start_Time
; MOV T_Time,RAX
; invoke printf,"%dms for 16bits reg loop %s",T_Time ,addr g
;***************************************************************
invoke printf,"%s",addr g
xor ah,ah
mov f,ah
invoke GetTickCount
MOV Start_Time,RAX
;---------------------------------------------

invoke CreateThread,0,0,offset MyThread,0,0,0
invoke CreateThread,0,0,offset MyThread,0,0,0
invoke CreateThread,0,0,offset MyThread,0,0,0
invoke CreateThread,0,0,offset MyThread,0,0,0
invoke CreateThread,0,0,offset MyThread,0,0,0
invoke CreateThread,0,0,offset MyThread,0,0,0
invoke CreateThread,0,0,offset MyThread,0,0,0
invoke CreateThread,0,0,offset MyThread,0,0,0
invoke CreateThread,0,0,offset MyThread,0,0,0
invoke CreateThread,0,0,offset MyThread,0,0,0

@@: cmp f,10
jne @B
;---------------------------------------------
invoke GetTickCount
SUB RAX,Start_Time
MOV T_Time,RAX
invoke printf,"%dms for 10 thread of 32bits reg loop %s",T_Time ,addr g

WKey
invoke ExitProcess,0
ret
entry_point endp
end ;Start


LiaoMi

  • Member
  • ****
  • Posts: 509
Re: Maximum loop
« Reply #5 on: September 09, 2018, 04:59:08 AM »
macro for loop unrolling - http://www.masmforum.com/board/index.php?PHPSESSID=786dd40408172108b65a5a36b09c88c0&topic=14708.0
for loop with constants - http://www.masmforum.com/board/index.php?PHPSESSID=8d46cd4ecb1688be429ab49694ec53e6&topic=9100.0
interesting information from the topic above ..
Code: [Select]
774 cycles, unrolled by 1
392 cycles, unrolled by 2
352 cycles, unrolled by 4
444 cycles, unrolled by 8
668 cycles, unrolled by 16
580 cycles, unrolled by 32
571 cycles, unrolled by 64
577 cycles, unrolled by 128
580 cycles, unrolled by 256

518 cycles, optimized for no unroll

Generalized Loop-Unrolling: a Method for Program Speed-Up https://pdfs.semanticscholar.org/2d2c/ae7a36dbeba1cd3a9b35c6517752116bf6c1.pdf

Code: [Select]
.686
.MMX
.XMM
.x64

option casemap:none
option win64:11
option frame:auto
option stackbase:rsp
;option proc:private

.nolist
.nocref

.code
start:

align 16
repct = 16

             xor eax,eax
 @@:
           REPEAT repct
             sub eax,1
             ;dec eax,1
             jz @F
   ENDM
          jmp @B
        @@:
             ret
             
end start


AW

  • Member
  • *****
  • Posts: 2100
  • Let's Make ASM Great Again!
Re: Maximum loop
« Reply #6 on: September 09, 2018, 05:01:09 AM »
Intel(R) Core(TM) i5-7300HQ CPU @ 2.50GHz 2.50GHz
14719ms for 10 * 32bits reg loop
4141ms for 10 thread of 32bits reg loop

But the threads are not starting all at the same time. I think they should wait all for an event to start the loop and  start the time counting at that point. This is something I have done in the Dining Philosophers problem, although has nothing to do with this.

hutch--

  • Administrator
  • Member
  • ******
  • Posts: 6391
  • Mnemonic Driven API Grinder
    • The MASM32 SDK
Re: Maximum loop
« Reply #7 on: September 09, 2018, 05:34:42 AM »
I have yet to see a clear rule on unrolling loops.

774 cycles, unrolled by 1
392 cycles, unrolled by 2
352 cycles, unrolled by 4
444 cycles, unrolled by 8
668 cycles, unrolled by 16
580 cycles, unrolled by 32
571 cycles, unrolled by 64
577 cycles, unrolled by 128
580 cycles, unrolled by 256

518 cycles, optimized for no unroll

This list more or less matches what I have found over years, try unrolling a loop by 2, 4, 8 or 16 and simply clock them. In most instances where the unroll make it faster, its either by 2 or 4 then you start to see it get slower again. It depends on the instructions used, a very short loop has the best chance of reducing loop control overhead where the longer the loop code, the less effective the technique becomes.

If its an algorithm I am interested in getting it faster, I test multiple levels of unroll and generally settle for the shortest of the fastest ones but I have also found that unrolls are less effective on later hardware from the Core2 series upwards. I still find a share of algorithms that always get slower when they are unrolled and strangely enough aligning the start of a loop almost always slows it down. I think some of these effects are a byproduct of the processors getting smarter and not liking workarounds to make certain types of code faster.
hutch at movsd dot com
http://www.masm32.com    :biggrin:  :skrewy:

daydreamer

  • Member
  • ****
  • Posts: 831
  • watch Chebyshev on the backside of the Moon
Re: Maximum loop
« Reply #8 on: September 09, 2018, 06:02:03 AM »
I have yet to see a clear rule on unrolling loops.

774 cycles, unrolled by 1
392 cycles, unrolled by 2
352 cycles, unrolled by 4
444 cycles, unrolled by 8
668 cycles, unrolled by 16
580 cycles, unrolled by 32
571 cycles, unrolled by 64
577 cycles, unrolled by 128
580 cycles, unrolled by 256

518 cycles, optimized for no unroll

This list more or less matches what I have found over years, try unrolling a loop by 2, 4, 8 or 16 and simply clock them. In most instances where the unroll make it faster, its either by 2 or 4 then you start to see it get slower again. It depends on the instructions used, a very short loop has the best chance of reducing loop control overhead where the longer the loop code, the less effective the technique becomes.

If its an algorithm I am interested in getting it faster, I test multiple levels of unroll and generally settle for the shortest of the fastest ones but I have also found that unrolls are less effective on later hardware from the Core2 series upwards. I still find a share of algorithms that always get slower when they are unrolled and strangely enough aligning the start of a loop almost always slows it down. I think some of these effects are a byproduct of the processors getting smarter and not liking workarounds to make certain types of code faster.
what about SIMD instructions that starts with 4 integers,0,1,2,3 and Add 4,4,4,4 ,and output 4 integers each loops,so you get 0,1,2,3,4,5,6,7,8...etc?must it not be the fastest and smallest unroll,compared to several integer instructions?
Quote from Flashdance
Nick  :  When you give up your dream, you die
*wears a flameproof asbestos suit*

mabdelouahab

  • Member
  • ***
  • Posts: 404
Re: Maximum loop
« Reply #9 on: September 09, 2018, 06:22:58 AM »
Quote
16828ms for 10 * 32bits reg loop
18390ms for 10 * LiaoMi loop
10953ms for 10 thread of 32bits reg loop

Siekmanski

  • Member
  • *****
  • Posts: 1830
Re: Maximum loop
« Reply #10 on: September 09, 2018, 06:54:43 PM »
Timing loops is a bit mysterious on different systems.
It all depends on the code and data L1 cache level sizes, code and data alignment and cache associativity.

I don't fully grasp the workings of cache associativity and how to use it to gain more speed...
Maybe somebody can explain it to me?  :idea:

To get accurate ( sort of ) cycle timing, it is a good thing to first wake up all the cores, so they run at full speed before you start the loop timings.

Align important loops at code cache line size and try to keep the loop code size within the L1 cache line size to prevent penalties. ( prefetching is not always faster on some architectures )
Do the same with data alignment, align it to the data level size. ( L1 L2 L3 )
On my computer the cache line size is 64 bytes.
If my loop code sizes are within 64 bytes, I dont need to unroll them to gain more speed.  8)

When unrolling your code, code alignment is not important anymore because you are exceeding the cache line size.

Here are some figures from my tests:

Unrolled 1 time:
203 Cycles for Test_Align4
194 Cycles for Test_Align16
163 Cycles for Test_Align64

Unrolled 10 times:
1625 Cycles for Test_Align4
1627 Cycles for Test_Align16
1624 Cycles for Test_Align64

Unrolled 100 times: ; exceeding the code level L1 cache size and getting penalties for it ( thus no ~16300 cycles )
17673 Cycles for Test_Align4
17411 Cycles for Test_Align16
17301 Cycles for Test_Align64

Creative coders use backward thinking techniques as a strategy.

LiaoMi

  • Member
  • ****
  • Posts: 509
Re: Maximum loop
« Reply #11 on: September 09, 2018, 08:54:08 PM »
Optimizing subroutines in assembly language Agner Fog's optimizations handbook with the loop unrolling technique - http://www.agner.org/optimize/optimizing_assembly.pdf

12.12 Loop unrolling
A loop that does n repetitions can be replaced by a loop that repeats n / r times and does r calculations for each repetition, where r is the unroll factor. n should preferably be divisible by r. Loop unrolling can be used for the following purposes:

 Reducing loop overhead. The loop overhead per calculation is divided by the loop unroll factor r. This is only useful if the loop overhead contributes significantly to the calculation time. There is no reason to unroll a loop if some other bottleneck limits the execution speed. For example, the loop in example 12.6e above cannot benefit from further unrolling.

 Vectorization. A loop must be rolled out by r or a multiple of r in order to use vector registers with r elements. The loop in example 12.6e is rolled out by 2 in order to use vectors of two double-precision numbers. If we had used single-precision numbers then we would have rolled out the loop by 4 and used vectors of 4 elements.

 Improve branch prediction. The prediction of the loop exit branch can be improved by unrolling the loop so much that the repeat count n / r does not exceed the maximum repeat count that can be predicted on a specific CPU.

 Improve caching. If the loop suffers from many data cache misses or cache contentions then it may be advantageous to schedule memory reads and writes in the way that is optimal for a specific processor. This is rarely needed on modern processors. See the optimization manual from the microprocessor vendor for details.

 Eliminate integer divisions. If the loop contains an expression where the loop counter i is divided by an integer r or the modulo of i by r is calculated, then the integer division can be avoided by unrolling the loop by r.

 Eliminate branch inside loop. If there is a branch or a switch statement inside the loop with a repetitive pattern of period r then this can be eliminated by unrolling the loop by r. For example, if an if-else branch goes either way every second time then this branch can be eliminated by rolling out by 2.

 Break loop-carried dependency chain. A loop-carried dependency chain can in some cases be broken up by using multiple accumulators. The unroll factor r is equal to103 the number of accumulators. See example 9.3b page 65.

 Reduce dependence of induction variable. If the latency of calculating an induction variable from the value in the previous iteration is so long that it becomes a bottleneck then it may be possible to solve this problem by unrolling by r and calculate each value of the induction variable from the value that is r places behind in the sequence.

 Complete unrolling. A loop is completely unrolled when r = n. This eliminates the loop overhead completely. Every expression that is a function of the loop counter can be replaced by constants. Every branch that depends only on the loop counter can be eliminated. See page 114 for examples. There is a problem with loop unrolling when the repeat count n is not divisible by the unroll factor r. There will be a remainder of n modulo r extra calculations that are not done inside the loop. These extra calculations have to be done either before or after the main loop. It can be quite tricky to get the extra calculations right when the repeat count is not divisible by the unroll factor; and it gets particularly tricky if we are using a negative index as in example 12.6d and e. The following example shows the DAXPY algorithm again, this time with single precision and unrolled by 4. In this example n is a variable which may or may not be divisible by 4. The arrays X and Y must be aligned by 16. (The optimization that was specific to the PM processor has been omitted for the sake of clarity).

Code: [Select]
; Example 12.7. Unrolled Loop of DAXPY, single precision.
.data
align 16
SignBitS DD 80000000H ; dword with sign bit set
.code
mov eax, n ; Number of calculations, n
sub eax, 4 ; n - 4
lea esi, [X + eax*4] ; Point to X[n-4]
lea edi, [Y + eax*4] ; Point to Y[n-4]
movss xmm2, DA ; Load DA
xorps xmm2, SignBitS ; Change sign
shufps xmm2, xmm2, 0 ; Get -DA into all four dwords of xmm2
neg eax ; -(n-4)
jg L2 ; Skip main loop if n < 4
L1: ; Main loop rolled out by 4
movaps xmm1, [esi+eax*4] ; Load 4 values from X
mulps xmm1, xmm2 ; Multiply with -DA
addps xmm1, [edi+eax*4] ; Add 4 values from Y
movaps [edi+eax*4],xmm1 ; Store 4 results in Y
add eax, 4 ; i += 4
jle L1 ; Loop as long as <= 0
L2: ; Check for remaining calculations
sub eax, 4 ; = -remainder
jns L4 ; Skip extra loop if remainder = 0
L3: ; Extra loop for up to 3 remaining calculations
movss xmm1, [esi+eax*4+16] ; Load 1 value from X
mulss xmm1, xmm2 ; Multiply with -DA
addss xmm1, [edi+eax*4+16] ; Add 1 value from Y
movss [edi+eax*4+16],xmm1 ; Store 1 result in Y
add eax, 1 ; i += 1
js L3 ; Loop as long as negative
L4:

An alternative solution for an unrolled loop that does calculations on arrays is to extend the arrays with up to r-1 unused spaces and rounding up the repeat count n to the nearest multiple of the unroll factor r. This eliminates the need for calculating the remainder (n mod r) and for the extra loop for the remaining calculations. The unused array elements must be initialized to zero or some other valid floating point value in order to avoid subnormal numbers, NAN, overflow, underflow, or any other condition that can slow down the floating point calculations. If the arrays are of integer type then the only condition you have to avoid is division by zero. Loop unrolling should only be used when there is a reason to do so and a significant gain in speed can be obtained. Excessive loop unrolling should be avoided. The disadvantages of loop unrolling are:

 The code becomes bigger and takes more space in the code cache. This can cause code cache misses that cost more than what is gained by the unrolling. Note that the code cache misses are not detected when the loop is tested in isolation.

 Many processors have a loop buffer to increase the speed of very small loops. The loop buffer is limited to 20 - 40 instructions or 64 bytes of code, depending on the processor. Loop unrolling is likely to decrease performance if it exceeds the size of the loop buffer. (Processor-specific details are provided in manual 3: "The microarchitecture of Intel, AMD and VIA CPUs".)

 Some processors have a µop cache of limited size. This µop cache is so valuable that its use should be economized. Unrolled loops take up more space in the µop cache. Note that the µop cache misses are not detected when the loop is tested in isolation.

 The need to do extra calculations outside the unrolled loop in case n is not divisible by r makes the code more complicated and clumsy and increases the number of branches.

 The unrolled loop may need more registers, e.g. for multiple accumulators.

12.13 Vector loops using mask registers (AVX512)

daydreamer

  • Member
  • ****
  • Posts: 831
  • watch Chebyshev on the backside of the Moon
Re: Maximum loop
« Reply #12 on: September 09, 2018, 09:32:18 PM »
that's one reason why I love to make SSE ***PS code and suggest to use SIMD for unrolling,my earlier experience with make SIMD and SSE2 macros in old forum made me understand opcodes various sizes,SSE ***PS performs 4 fpu instructions with only 3byte opcodes+eventual adress/shuffle immediate byte
so 100 unrolls x cpu/fpu/SSE scalar opcode size =instead 25 SSE unrolled x3 size
+ you get the usual benefits of use SIMD parallel x4 calculations and better 128bit mov's
Quote from Flashdance
Nick  :  When you give up your dream, you die
*wears a flameproof asbestos suit*