### Author Topic: Maximum loop  (Read 2783 times)

#### mabdelouahab

• Member
• Posts: 493
##### Maximum loop
« on: September 08, 2018, 09:36:00 PM »
In my PC (Intel Core I5), I wanted to try the maximum possible loop using a registry
32 bit, this loop takes 2600 milliseconds (4 294 967 295 times):
Code: [Select]
` xor eax,eax @@: dec eax jnz @B`
64 bit, I did not try it, but according to the calculation, this loop takes 350 years (18 446 744 065 119 617 025 times):
Code: [Select]
` xor rax,rax @@: dec rax jnz @B`

#### jj2007

• Member
• Posts: 11482
• Assembler is fun ;-)
##### Re: Maximum loop
« Reply #1 on: September 08, 2018, 09:50:37 PM »
My i5 is a bit faster... I wonder why:
Code: [Select]
`Intel(R) Core(TM) i5-2450M CPU @ 2.50GHz1447 ms`
Try running the loop three times. Sometimes it takes the cpu a while to arrive a full speed.

#### mabdelouahab

• Member
• Posts: 493
##### Re: Maximum loop
« Reply #2 on: September 08, 2018, 10:02:57 PM »
Hi JJ
In your device the second loop takes 197 years
My I5 is : Intel(R) Core(TM) i5-4210M CPU @ 1.70GHz 2.40GHz

#### LiaoMi

• Member
• Posts: 877
##### Re: Maximum loop
« Reply #3 on: September 08, 2018, 10:12:07 PM »

The noploop CPU Benchmark http://www.brendangregg.com/blog/2014-04-26/the-noploop-cpu-benchmark.html

What happens if I make a loopunrolling ?! https://en.wikipedia.org/wiki/Loop_unrolling

#### mabdelouahab

• Member
• Posts: 493
##### Re: Maximum loop
« Reply #4 on: September 09, 2018, 02:50:57 AM »

Quote
18344ms for 10 * 32bits reg loop
10875ms for 10 thread of 32bits reg loop

Code: [Select]
`OPTION DOTNAMEOPTION LITERALS:ON    .includelib macro lib_:VARARG FOR lib__,<lib_> INCLUDELIB \masm32\lib64\&lib__&.LIB endm endm .proto macro proc_:VARARG FOR proc__,<proc_> &proc__& proto :vararg endm endm  WKey MACRO  LOCAL lbl  lbl:  call _kbhit     test rax, rax     jz lbl  endm .includelib kernel32,msvcrt .proto ExitProcess,printf,GetTickCount,CreateThread,_kbhit.data g db 13,10,0 f db 0 f2 db 0.code MyThread proc xor eax,eax @@: dec eax jnz @B inc f ret MyThread endp MyThread2 proc xor bx,bx xor ax,ax @@: dec ax jnz @B dec bx jnz @B inc f2 ret MyThread2 endp entry_point proc LOCAL Start_Time:qword LOCAL T_Time:qword ;*************************************************************** invoke GetTickCount MOV Start_Time,RAX ;--------------------------------------------- call MyThread call MyThread call MyThread call MyThread call MyThread call MyThread call MyThread call MyThread call MyThread call MyThread ;--------------------------------------------- invoke GetTickCount SUB RAX,Start_Time MOV T_Time,RAX invoke printf,"%dms for 10 * 32bits reg loop ",T_Time ; ;*************************************************************** ; invoke GetTickCount; MOV Start_Time,RAX; ;---------------------------------------------; call MyThread2; call MyThread2; call MyThread2; call MyThread2; call MyThread2; call MyThread2; call MyThread2; call MyThread2; call MyThread2; call MyThread2; ;---------------------------------------------; invoke GetTickCount; SUB RAX,Start_Time; MOV T_Time,RAX; invoke printf,"%dms for 16bits reg loop %s",T_Time ,addr g ;*************************************************************** invoke printf,"%s",addr g xor ah,ah mov f,ah invoke GetTickCount MOV Start_Time,RAX ;--------------------------------------------- invoke CreateThread,0,0,offset MyThread,0,0,0 invoke CreateThread,0,0,offset MyThread,0,0,0 invoke CreateThread,0,0,offset MyThread,0,0,0 invoke CreateThread,0,0,offset MyThread,0,0,0 invoke CreateThread,0,0,offset MyThread,0,0,0 invoke CreateThread,0,0,offset MyThread,0,0,0 invoke CreateThread,0,0,offset MyThread,0,0,0 invoke CreateThread,0,0,offset MyThread,0,0,0 invoke CreateThread,0,0,offset MyThread,0,0,0 invoke CreateThread,0,0,offset MyThread,0,0,0 @@: cmp f,10 jne @B ;--------------------------------------------- invoke GetTickCount SUB RAX,Start_Time MOV T_Time,RAX invoke printf,"%dms for 10 thread of 32bits reg loop %s",T_Time ,addr g WKey invoke ExitProcess,0 ret entry_point endpend ;Start`

#### LiaoMi

• Member
• Posts: 877
##### Re: Maximum loop
« Reply #5 on: September 09, 2018, 04:59:08 AM »
macro for loop unrolling - http://www.masmforum.com/board/index.php?PHPSESSID=786dd40408172108b65a5a36b09c88c0&topic=14708.0
for loop with constants - http://www.masmforum.com/board/index.php?PHPSESSID=8d46cd4ecb1688be429ab49694ec53e6&topic=9100.0
interesting information from the topic above ..
Code: [Select]
`774 cycles, unrolled by 1392 cycles, unrolled by 2352 cycles, unrolled by 4444 cycles, unrolled by 8668 cycles, unrolled by 16580 cycles, unrolled by 32571 cycles, unrolled by 64577 cycles, unrolled by 128580 cycles, unrolled by 256518 cycles, optimized for no unroll`
Generalized Loop-Unrolling: a Method for Program Speed-Up https://pdfs.semanticscholar.org/2d2c/ae7a36dbeba1cd3a9b35c6517752116bf6c1.pdf

Code: [Select]
` .686.MMX.XMM.x64option casemap:noneoption win64:11option frame:autooption stackbase:rsp;option proc:private.nolist.nocref.codestart: align 16 repct = 16             xor eax,eax @@:           REPEAT repct             sub eax,1             ;dec eax,1             jz @F    ENDM          jmp @B        @@:             ret             end start`

#### aw27

• Guest
##### Re: Maximum loop
« Reply #6 on: September 09, 2018, 05:01:09 AM »
Intel(R) Core(TM) i5-7300HQ CPU @ 2.50GHz 2.50GHz
14719ms for 10 * 32bits reg loop
4141ms for 10 thread of 32bits reg loop

But the threads are not starting all at the same time. I think they should wait all for an event to start the loop and  start the time counting at that point. This is something I have done in the Dining Philosophers problem, although has nothing to do with this.

#### hutch--

• Member
• Posts: 8369
• Mnemonic Driven API Grinder
##### Re: Maximum loop
« Reply #7 on: September 09, 2018, 05:34:42 AM »
I have yet to see a clear rule on unrolling loops.

774 cycles, unrolled by 1
392 cycles, unrolled by 2
352 cycles, unrolled by 4
444 cycles, unrolled by 8
668 cycles, unrolled by 16
580 cycles, unrolled by 32
571 cycles, unrolled by 64
577 cycles, unrolled by 128
580 cycles, unrolled by 256

518 cycles, optimized for no unroll

This list more or less matches what I have found over years, try unrolling a loop by 2, 4, 8 or 16 and simply clock them. In most instances where the unroll make it faster, its either by 2 or 4 then you start to see it get slower again. It depends on the instructions used, a very short loop has the best chance of reducing loop control overhead where the longer the loop code, the less effective the technique becomes.

If its an algorithm I am interested in getting it faster, I test multiple levels of unroll and generally settle for the shortest of the fastest ones but I have also found that unrolls are less effective on later hardware from the Core2 series upwards. I still find a share of algorithms that always get slower when they are unrolled and strangely enough aligning the start of a loop almost always slows it down. I think some of these effects are a byproduct of the processors getting smarter and not liking workarounds to make certain types of code faster.
hutch at movsd dot com
http://www.masm32.com

#### daydreamer

• Member
• Posts: 1644
• building nextdoor
##### Re: Maximum loop
« Reply #8 on: September 09, 2018, 06:02:03 AM »
I have yet to see a clear rule on unrolling loops.

774 cycles, unrolled by 1
392 cycles, unrolled by 2
352 cycles, unrolled by 4
444 cycles, unrolled by 8
668 cycles, unrolled by 16
580 cycles, unrolled by 32
571 cycles, unrolled by 64
577 cycles, unrolled by 128
580 cycles, unrolled by 256

518 cycles, optimized for no unroll

This list more or less matches what I have found over years, try unrolling a loop by 2, 4, 8 or 16 and simply clock them. In most instances where the unroll make it faster, its either by 2 or 4 then you start to see it get slower again. It depends on the instructions used, a very short loop has the best chance of reducing loop control overhead where the longer the loop code, the less effective the technique becomes.

If its an algorithm I am interested in getting it faster, I test multiple levels of unroll and generally settle for the shortest of the fastest ones but I have also found that unrolls are less effective on later hardware from the Core2 series upwards. I still find a share of algorithms that always get slower when they are unrolled and strangely enough aligning the start of a loop almost always slows it down. I think some of these effects are a byproduct of the processors getting smarter and not liking workarounds to make certain types of code faster.
what about SIMD instructions that starts with 4 integers,0,1,2,3 and Add 4,4,4,4 ,and output 4 integers each loops,so you get 0,1,2,3,4,5,6,7,8...etc?must it not be the fastest and smallest unroll,compared to several integer instructions?
SIMD fan and macro fan
why assembly is fastest is because its switch has no (brakes) breaks
:P
only in 16bit assembly you can get away with "Only words" :P

#### mabdelouahab

• Member
• Posts: 493
##### Re: Maximum loop
« Reply #9 on: September 09, 2018, 06:22:58 AM »
Quote
16828ms for 10 * 32bits reg loop
18390ms for 10 * LiaoMi loop
10953ms for 10 thread of 32bits reg loop

#### Siekmanski

• Member
• Posts: 2363
##### Re: Maximum loop
« Reply #10 on: September 09, 2018, 06:54:43 PM »
Timing loops is a bit mysterious on different systems.
It all depends on the code and data L1 cache level sizes, code and data alignment and cache associativity.

I don't fully grasp the workings of cache associativity and how to use it to gain more speed...
Maybe somebody can explain it to me?  :idea:

To get accurate ( sort of ) cycle timing, it is a good thing to first wake up all the cores, so they run at full speed before you start the loop timings.

Align important loops at code cache line size and try to keep the loop code size within the L1 cache line size to prevent penalties. ( prefetching is not always faster on some architectures )
Do the same with data alignment, align it to the data level size. ( L1 L2 L3 )
On my computer the cache line size is 64 bytes.
If my loop code sizes are within 64 bytes, I dont need to unroll them to gain more speed.  8)

When unrolling your code, code alignment is not important anymore because you are exceeding the cache line size.

Here are some figures from my tests:

Unrolled 1 time:
203 Cycles for Test_Align4
194 Cycles for Test_Align16
163 Cycles for Test_Align64

Unrolled 10 times:
1625 Cycles for Test_Align4
1627 Cycles for Test_Align16
1624 Cycles for Test_Align64

Unrolled 100 times: ; exceeding the code level L1 cache size and getting penalties for it ( thus no ~16300 cycles )
17673 Cycles for Test_Align4
17411 Cycles for Test_Align16
17301 Cycles for Test_Align64

Creative coders use backward thinking techniques as a strategy.

#### LiaoMi

• Member
• Posts: 877
##### Re: Maximum loop
« Reply #11 on: September 09, 2018, 08:54:08 PM »
Optimizing subroutines in assembly language Agner Fog's optimizations handbook with the loop unrolling technique - http://www.agner.org/optimize/optimizing_assembly.pdf

12.12 Loop unrolling
A loop that does n repetitions can be replaced by a loop that repeats n / r times and does r calculations for each repetition, where r is the unroll factor. n should preferably be divisible by r. Loop unrolling can be used for the following purposes:

 Reducing loop overhead. The loop overhead per calculation is divided by the loop unroll factor r. This is only useful if the loop overhead contributes significantly to the calculation time. There is no reason to unroll a loop if some other bottleneck limits the execution speed. For example, the loop in example 12.6e above cannot benefit from further unrolling.

 Vectorization. A loop must be rolled out by r or a multiple of r in order to use vector registers with r elements. The loop in example 12.6e is rolled out by 2 in order to use vectors of two double-precision numbers. If we had used single-precision numbers then we would have rolled out the loop by 4 and used vectors of 4 elements.

 Improve branch prediction. The prediction of the loop exit branch can be improved by unrolling the loop so much that the repeat count n / r does not exceed the maximum repeat count that can be predicted on a specific CPU.

 Improve caching. If the loop suffers from many data cache misses or cache contentions then it may be advantageous to schedule memory reads and writes in the way that is optimal for a specific processor. This is rarely needed on modern processors. See the optimization manual from the microprocessor vendor for details.

 Eliminate integer divisions. If the loop contains an expression where the loop counter i is divided by an integer r or the modulo of i by r is calculated, then the integer division can be avoided by unrolling the loop by r.

 Eliminate branch inside loop. If there is a branch or a switch statement inside the loop with a repetitive pattern of period r then this can be eliminated by unrolling the loop by r. For example, if an if-else branch goes either way every second time then this branch can be eliminated by rolling out by 2.

 Break loop-carried dependency chain. A loop-carried dependency chain can in some cases be broken up by using multiple accumulators. The unroll factor r is equal to103 the number of accumulators. See example 9.3b page 65.

 Reduce dependence of induction variable. If the latency of calculating an induction variable from the value in the previous iteration is so long that it becomes a bottleneck then it may be possible to solve this problem by unrolling by r and calculate each value of the induction variable from the value that is r places behind in the sequence.

 Complete unrolling. A loop is completely unrolled when r = n. This eliminates the loop overhead completely. Every expression that is a function of the loop counter can be replaced by constants. Every branch that depends only on the loop counter can be eliminated. See page 114 for examples. There is a problem with loop unrolling when the repeat count n is not divisible by the unroll factor r. There will be a remainder of n modulo r extra calculations that are not done inside the loop. These extra calculations have to be done either before or after the main loop. It can be quite tricky to get the extra calculations right when the repeat count is not divisible by the unroll factor; and it gets particularly tricky if we are using a negative index as in example 12.6d and e. The following example shows the DAXPY algorithm again, this time with single precision and unrolled by 4. In this example n is a variable which may or may not be divisible by 4. The arrays X and Y must be aligned by 16. (The optimization that was specific to the PM processor has been omitted for the sake of clarity).

Code: [Select]
`; Example 12.7. Unrolled Loop of DAXPY, single precision..dataalign 16SignBitS DD 80000000H ; dword with sign bit set.codemov eax, n ; Number of calculations, nsub eax, 4 ; n - 4lea esi, [X + eax*4] ; Point to X[n-4]lea edi, [Y + eax*4] ; Point to Y[n-4]movss xmm2, DA ; Load DAxorps xmm2, SignBitS ; Change signshufps xmm2, xmm2, 0 ; Get -DA into all four dwords of xmm2neg eax ; -(n-4)jg L2 ; Skip main loop if n < 4L1: ; Main loop rolled out by 4movaps xmm1, [esi+eax*4] ; Load 4 values from Xmulps xmm1, xmm2 ; Multiply with -DAaddps xmm1, [edi+eax*4] ; Add 4 values from Ymovaps [edi+eax*4],xmm1 ; Store 4 results in Yadd eax, 4 ; i += 4jle L1 ; Loop as long as <= 0L2: ; Check for remaining calculationssub eax, 4 ; = -remainderjns L4 ; Skip extra loop if remainder = 0L3: ; Extra loop for up to 3 remaining calculationsmovss xmm1, [esi+eax*4+16] ; Load 1 value from Xmulss xmm1, xmm2 ; Multiply with -DAaddss xmm1, [edi+eax*4+16] ; Add 1 value from Ymovss [edi+eax*4+16],xmm1 ; Store 1 result in Yadd eax, 1 ; i += 1js L3 ; Loop as long as negativeL4:`
An alternative solution for an unrolled loop that does calculations on arrays is to extend the arrays with up to r-1 unused spaces and rounding up the repeat count n to the nearest multiple of the unroll factor r. This eliminates the need for calculating the remainder (n mod r) and for the extra loop for the remaining calculations. The unused array elements must be initialized to zero or some other valid floating point value in order to avoid subnormal numbers, NAN, overflow, underflow, or any other condition that can slow down the floating point calculations. If the arrays are of integer type then the only condition you have to avoid is division by zero. Loop unrolling should only be used when there is a reason to do so and a significant gain in speed can be obtained. Excessive loop unrolling should be avoided. The disadvantages of loop unrolling are:

 The code becomes bigger and takes more space in the code cache. This can cause code cache misses that cost more than what is gained by the unrolling. Note that the code cache misses are not detected when the loop is tested in isolation.

 Many processors have a loop buffer to increase the speed of very small loops. The loop buffer is limited to 20 - 40 instructions or 64 bytes of code, depending on the processor. Loop unrolling is likely to decrease performance if it exceeds the size of the loop buffer. (Processor-specific details are provided in manual 3: "The microarchitecture of Intel, AMD and VIA CPUs".)

 Some processors have a µop cache of limited size. This µop cache is so valuable that its use should be economized. Unrolled loops take up more space in the µop cache. Note that the µop cache misses are not detected when the loop is tested in isolation.

 The need to do extra calculations outside the unrolled loop in case n is not divisible by r makes the code more complicated and clumsy and increases the number of branches.

 The unrolled loop may need more registers, e.g. for multiple accumulators.

12.13 Vector loops using mask registers (AVX512)

#### daydreamer

• Member
• Posts: 1644
• building nextdoor
##### Re: Maximum loop
« Reply #12 on: September 09, 2018, 09:32:18 PM »
that's one reason why I love to make SSE ***PS code and suggest to use SIMD for unrolling,my earlier experience with make SIMD and SSE2 macros in old forum made me understand opcodes various sizes,SSE ***PS performs 4 fpu instructions with only 3byte opcodes+eventual adress/shuffle immediate byte
so 100 unrolls x cpu/fpu/SSE scalar opcode size =instead 25 SSE unrolled x3 size
+ you get the usual benefits of use SIMD parallel x4 calculations and better 128bit mov's
SIMD fan and macro fan
why assembly is fastest is because its switch has no (brakes) breaks
:P
only in 16bit assembly you can get away with "Only words" :P