Optimizing subroutines in assembly language Agner Fog's optimizations handbook with the loop unrolling technique -

http://www.agner.org/optimize/optimizing_assembly.pdf**12.12 Loop unrolling**A loop that does n repetitions can be replaced by a loop that repeats n / r times and does r calculations for each repetition, where r is the unroll factor. n should preferably be divisible by r. Loop unrolling can be used for the following purposes:

Reducing loop overhead. The loop overhead per calculation is divided by the loop unroll factor r. This is only useful if the loop overhead contributes significantly to the calculation time. There is no reason to unroll a loop if some other bottleneck limits the execution speed. For example, the loop in example 12.6e above cannot benefit from further unrolling.

Vectorization. A loop must be rolled out by r or a multiple of r in order to use vector registers with r elements. The loop in example 12.6e is rolled out by 2 in order to use vectors of two double-precision numbers. If we had used single-precision numbers then we would have rolled out the loop by 4 and used vectors of 4 elements.

Improve branch prediction. The prediction of the loop exit branch can be improved by unrolling the loop so much that the repeat count n / r does not exceed the maximum repeat count that can be predicted on a specific CPU.

Improve caching. If the loop suffers from many data cache misses or cache contentions then it may be advantageous to schedule memory reads and writes in the way that is optimal for a specific processor. This is rarely needed on modern processors. See the optimization manual from the microprocessor vendor for details.

Eliminate integer divisions. If the loop contains an expression where the loop counter i is divided by an integer r or the modulo of i by r is calculated, then the integer division can be avoided by unrolling the loop by r.

Eliminate branch inside loop. If there is a branch or a switch statement inside the loop with a repetitive pattern of period r then this can be eliminated by unrolling the loop by r. For example, if an if-else branch goes either way every second time then this branch can be eliminated by rolling out by 2.

Break loop-carried dependency chain. A loop-carried dependency chain can in some cases be broken up by using multiple accumulators. The unroll factor r is equal to103 the number of accumulators. See example 9.3b page 65.

Reduce dependence of induction variable. If the latency of calculating an induction variable from the value in the previous iteration is so long that it becomes a bottleneck then it may be possible to solve this problem by unrolling by r and calculate each value of the induction variable from the value that is r places behind in the sequence.

Complete unrolling. A loop is completely unrolled when r = n. This eliminates the loop overhead completely. Every expression that is a function of the loop counter can be replaced by constants. Every branch that depends only on the loop counter can be eliminated. See page 114 for examples. There is a problem with loop unrolling when the repeat count n is not divisible by the unroll factor r. There will be a remainder of n modulo r extra calculations that are not done inside the loop. These extra calculations have to be done either before or after the main loop. It can be quite tricky to get the extra calculations right when the repeat count is not divisible by the unroll factor; and it gets particularly tricky if we are using a negative index as in example 12.6d and e. The following example shows the DAXPY algorithm again, this time with single precision and unrolled by 4. In this example n is a variable which may or may not be divisible by 4. The arrays X and Y must be aligned by 16. (The optimization that was specific to the PM processor has been omitted for the sake of clarity).

`; Example 12.7. Unrolled Loop of DAXPY, single precision.`

.data

align 16

SignBitS DD 80000000H ; dword with sign bit set

.code

mov eax, n ; Number of calculations, n

sub eax, 4 ; n - 4

lea esi, [X + eax*4] ; Point to X[n-4]

lea edi, [Y + eax*4] ; Point to Y[n-4]

movss xmm2, DA ; Load DA

xorps xmm2, SignBitS ; Change sign

shufps xmm2, xmm2, 0 ; Get -DA into all four dwords of xmm2

neg eax ; -(n-4)

jg L2 ; Skip main loop if n < 4

L1: ; Main loop rolled out by 4

movaps xmm1, [esi+eax*4] ; Load 4 values from X

mulps xmm1, xmm2 ; Multiply with -DA

addps xmm1, [edi+eax*4] ; Add 4 values from Y

movaps [edi+eax*4],xmm1 ; Store 4 results in Y

add eax, 4 ; i += 4

jle L1 ; Loop as long as <= 0

L2: ; Check for remaining calculations

sub eax, 4 ; = -remainder

jns L4 ; Skip extra loop if remainder = 0

L3: ; Extra loop for up to 3 remaining calculations

movss xmm1, [esi+eax*4+16] ; Load 1 value from X

mulss xmm1, xmm2 ; Multiply with -DA

addss xmm1, [edi+eax*4+16] ; Add 1 value from Y

movss [edi+eax*4+16],xmm1 ; Store 1 result in Y

add eax, 1 ; i += 1

js L3 ; Loop as long as negative

L4:

An alternative solution for an unrolled loop that does calculations on arrays is to extend the arrays with up to r-1 unused spaces and rounding up the repeat count n to the nearest multiple of the unroll factor r. This eliminates the need for calculating the remainder (n mod r) and for the extra loop for the remaining calculations. The unused array elements must be initialized to zero or some other valid floating point value in order to avoid subnormal numbers, NAN, overflow, underflow, or any other condition that can slow down the floating point calculations. If the arrays are of integer type then the only condition you have to avoid is division by zero. Loop unrolling should only be used when there is a reason to do so and a significant gain in speed can be obtained. Excessive loop unrolling should be avoided. The disadvantages of loop unrolling are:

The code becomes bigger and takes more space in the code cache. This can cause code cache misses that cost more than what is gained by the unrolling. Note that the code cache misses are not detected when the loop is tested in isolation.

Many processors have a loop buffer to increase the speed of very small loops. The loop buffer is limited to 20 - 40 instructions or 64 bytes of code, depending on the processor. Loop unrolling is likely to decrease performance if it exceeds the size of the loop buffer. (Processor-specific details are provided in manual 3: "The microarchitecture of Intel, AMD and VIA CPUs".)

Some processors have a µop cache of limited size. This µop cache is so valuable that its use should be economized. Unrolled loops take up more space in the µop cache. Note that the µop cache misses are not detected when the loop is tested in isolation.

The need to do extra calculations outside the unrolled loop in case n is not divisible by r makes the code more complicated and clumsy and increases the number of branches.

The unrolled loop may need more registers, e.g. for multiple accumulators.

**12.13 Vector loops using mask registers (AVX512)**