Assembly Optimization Tips

by

Mark Larson

The most important thing to remember is to TIME your code. Trying different tricks might or might not speed up your code. So it is very important to time your code to see if you do get a speedup as you try each trick.

Beginner

Freeing up all 8 CPU registers for use in your code. This also demonstrates how to save a GP register in an MMX register

    push ebx 
    push esi 
    push edi 
    push ebp                    ; has to be done before changing ESP 
 
    ;load up ESI and EDI and your other registers with the vales passed
    ;in on the stack. This has to be done before freeing up ESP. 
 
    movd mm0,esp                ; no pushing/popping past this point.
    xor  ebx,ebx                ; So how do you save?  A variable. 
    mov  esp,5 
 
inner_loop: 
    mov   [eax_save],eax        ; eax_save is a global variable. Can't be
    movzx eax,word ptr [fred_2] ; local because that requires EBP to point to it. 
    add   ebx,eax    
    mov   eax,[eax_save]        ; restore eax 

    movd  esp,mm0               ; has to be done before you do the POPs 

    pop   ebp 
    pop   edi 
    pop   esi 
    pop   ebx 
    ret

Maximize register usage.
Most high level compilers generate a lot of memory accesses to variables. I usually get around that by trying to keep most of them in registers. Having all 8 cpu registers free can really come in handy.
Complex instructions
Avoid complex instructions ( lods, stos, movs, cmps, scas, loop, xadd, enter, leave). Complex instructions are instructions that do multiple things. For instance stosb writes a byte to memory and also increments EDI. They stopped making these fast with the original Pentium because they were trying to make it more RISC like. Using REP and the string instructions is still fast. That is the only exception to the case.
Don't use INC/DEC on P4
On the P4 use ADD/SUB in place of INC/DEC. Generally it is faster. ADD/SUB runs in 0.5 cycles. INC/DEC takes 1 cycle.
Rotating
Avoid rotate by a register or rotate by an immediate value of anything but a 1.
Eliminate unnecessary compare instructions
Eliminate unnecessary compare instructions by doing the appropriate conditional jump instruction based on the flags that are already set from a previous arithmetic instruction.
```
        dec     ecx
        cmp     ecx,0
        jnz     loop_again

;gets changed to
        dec     ecx
        jnz     loop_again
```
LEA is still really cool, except for on the P4 it tends to be slow.
You can perform multiple math operations all in one instruction and it does not affect the flags register so you can put in in between one register being modified and a flags comparison jump on the next line.
```
top_of_loop: 

    dec   eax 
    lea   edx,[edx*4+3]     ; multiply by 4 and add 3. Does not affect flags
    jnz   top_of_loop       ; so the next instruction doesn't get hosed.
```

ADC and SBB.

Most compilers don't really make good use of ADC and SBB. You can get good speeds ups with that. Adding 2 64-bit numbers together, or adding big numbers together. Keep in mind that on the P4 ADC and SBB are slow. As a work around you can use "addq", and use MMX to do this. So the second optimization suggestion for this is to use MMX to do the adding or subtracting. You just have to have a processor that supports MMX.

    add    eax,[fred] 
    adc    edx,[fred+4] 

    ; the above 2 statements are the same as the below 3 statements 

    movd   mm0,[fred]       ; Get 32-bit value in MM0 
    movd   mm1,[fred+4]     ; Get 32-bit value in MM1 
    paddq  mm0,mm1          ; This is an unoptimized way to do it. You would
                            ; really pre-read MM0 and MM1 a loop in advance.
                            ; I did it this way for ease of understanding.

ROL, ROR, RCL, and RCR and BSWAP.
It is a cool trick to switch from Big Endian to Little Endian using BSWAP. Also you can use it for temporary storage of a 16-bit or 8-bit value in the upper half of the register. Likewise you can use ROL and ROR for storing 8-bit and 16-bit values. It's a way to get more "registers". If all you are dealing with are 16-bit values, you can turn your 8 32-bit registers into 16 16-bit registers. Which gives you a lot more registers to use. RCL and RCR can also easily be used for counting the number of bits that are set in a register. Keep in mind that ROL, ROR, RCL, RCR and BSWAP are all slow on the P4. The rotate instructions are about twice as fast as BSWAP. So if you have to use one or the other on the P4 use the rotate ones.
```
    xor   edx,edx           ; set both 16-bit registers to 0 
    mov   dx,234            ; set the first 16-bit register to 234 
    bswap edx               ; swap it so the second one is ready 
    mov   dx,345            ; set the second 16-bit register to 345 
    bswap edx               ; swap to the first one 
    add   dx,5              ; add 5 to the first one 
    bswap edx               ; swap to the second one 
    add   dx,7              ; add 7 to it 
```
String instructions.
Most compilers don't make good use of the string instructions ( scas, cmps, stos, movs, and lods). So checking to see if that is faster than some library routine can be a win. For instance I was really surprised when I looked at strlen() in VC++. In the radix40 code it ran in 416 cycles for a 100 byte string!!! I thought that was absurdly slow.
Multiply to divide.
If you have a full 32-bit number and you need to divide, you can simply do a multiply and take the top 32-bit half as the result. This is faster because multiplication is faster than division. ( thanks to pdixon for the tip).
Dividing by a constant.
There's some nice information now how to divide by a constant in Agner Fog's pentopt.pdf document. I wrote a program that you pass in the number you want to divide by, and it will print out the assembler code sequence. I will dig it up later and post it. Here is a link to Agner's document. Agner's Pentopt PDF (http://www.agner.org/assem/)

Unrolling.

This is a guideline. Unrolling falls in the General Optimiation category but I wanted to add a footnote. I always set up my Unrolling with a macro that unrolls an EQUATE value amount. That way you can try different values and see which is best easily. You want the unrolling to fit in the L1 code cache ( or trace cache). Using an equate makes it easy to try different unroll amounts to find the fastest one.

UNROLL_AMT       equ   16   ; # of times to unroll the loop
UNROLL_NUM_BYTES equ    4   ; # of bytes handled in 1 loop iteration

        mov     ecx,1024
looper:
offset2 = 0
REPEAT UNROLL_AMT
        add     eax,[edi+offset2]
offset2 = offset2 + UNROLL_NUM_BYTES
        add     edi,UNROLL_AMT * UNROLL_NUM_BYTES   ; we dealt with 16*4 bytes.
        sub     ecx,UNROLL_AMT  ; subtract from loop counter the # of loops we unrolled.
        jnz     looper

MOVZX.
Use MOVZX to avoid partial register stalls. I use MOVZX a lot. A lot of people XOR the full 32-bit register first. But MOVZX does the equivalent thing without having to have an extra XOR instruction. Also you had to do the XOR enough in advance to give it time to complete. With MOVZX you don't have to worry about that.

Using MOVZX to avoid a SHIFT and AND instruction

I ran across this bit of C code I was trying to speed up using assembler. The_array is a dword array. The code is trying to get a different byte from a dword in the array passed upon which pass this is over the data. "Pass" is a variable that goes from 0 to 3 for each byte in a particular dword.

        unsigned char c = ((the_array[i])>>(Pass<<3)) & 0xFF;

; I got rid of the "pass" variable by unrolling the loop 4 times.
; So I had 4 of these each one seperated by lots of C code.
        unsigned char c = (the_array[i])>>0) & 0xFF;
        unsigned char c = (the_array[i])>>8) & 0xFF;
        unsigned char c = (the_array[i])>>16) & 0xFF;
        unsigned char c = (the_array[i])>>24) & 0xFF;

What if I can get rid of the SHIFT and the AND using assembler? That would save me 2 instructions. Not to mention the fact that the P4 is very slow when doing SHIFT instructions ( 4 cycles!!!). So try to avoid shifts where possible. SO taking just the second to last line that shifts right 16 as our example

; esi points to the_array

        mov     eax,[esi]
        shr     eax,16
        and     eax,0FFh

; So how do we change that to get rid of the AND and SHR?
; We do a MOVZx with the 3rd byte in the dword.

        movzx   eax,byte ptr [esi+2]            ;unsigned char c = (the_array[i])>>16) & 0xFF;

Align, align, align.
It is really important to align both your code and data to get a good speed up. I generally align code on 4 byte boundaries. For data I align 2 byte data on 2 byte boundaries, 4 byte data on 4 byte boundaries, 8 byte data on 8 byte boundaries, 16 byte data on 16 byte boundaries. In general if you don't align your SSE or SSE2 data on a 16-byte boundary you will get an exception. You can align your data in VC++ if you have the processor pack. They added support for both static data and dynamic memory. For static data you use __declspec(align(4)) - alignes on a 4 byte boundary.
BSR for powers of 2.
You can use BSR to count the highest power of 2 that goes into a variable.
XORing a register with itself to zero it.
This is an oldie, but I am including it anyway. It also has a side benefit of clearing dependencies on the register. That is why sometimes you will see people use XOR in that fashion, before doing a partial register access. I prefer using MOVZX to doing it that way because it is trickier to do using a XOR ( read my above comments about in #12 above talking about MOVZX) . On the P4 they also added support for PXOR to break dependencies in that fashion. I think the P3 does the same thing.
Use XOR and DIV.
If you know your data can be unsigned for a DIVISION, use XOR EDX, EDX, then DIV. It's faster than CDQ and IDIV.
Try to avoid obvious dependencies.
If you modify a register and then compare it to some value on the very next line, instead try and put some other register modification in between. Dependencies are any time you modify a register and then read it or write it shortly afterwards.
```
   inc edi 
   inc eax 
   cmp eax,1    ; this line has a dependency with the previous line, so it will stall.
   jz  fred 

;shuffling the instructions around we can help break up dependencies.
   inc eax 
   inc edi 
   cmp eax,1 
   jz  fred 
```
Instructions to avoid on P4.
On P4's try to avoid the following instructions, adc, sbb, rotate instructions, shift instructions, inc, dec, lea, and any instruction taking more than 4 uops. How do you know the processor running the code is a P4? CPUID.
Using lookup tables.
On the P4 sometimes you can get around the long latency instructions that I listed previously by doing lookup tables. Thankfully on P4's they come with really fast memory. So having to do a lookup table doesn't hurt performance as much if it isn't in the cache.

Use pointers instead of calculating indexes.

A lot of times in loops in C there will be multiplications by non-powers of 2 numbers. You can easily get around this by adding instead. Here is an example that uses a structure.

typedef struct fred 
{ 
   int fred; 
   char bif; 
} freddy_type; 

freddy_type charmin[80];

The size of freddy_type is 5 bytes. If you try and access them in a loop the compiler will generate code for multipling by 5 for each array access!!!! (Ewwwwwwwwwwwww). So how do we do it properly?

for ( int t = 0; t < 80; t++) 
{ 
   charmin[t].fred = rand(); // the compiler multiplies by 5 to get the offset, EWWWWWWWW!
   charmin[t].bif = (char)(rand() % 256); 
} 

; in assembler we start with an offset of 0, that points to the first data item.
; And then we add 5 to it each loop iteration to avoid the MUL.

   mov   esi,offset charmin 
   mov   ecx,80 
fred_loop: 
   ;... perform operations on the FRED and BIF elements in freddy_type 
   add   esi,5                     ;make it point to the next structure entry. 
   dec   ecx 
   jnz   fred_loop

The MUL removal applies to loops as well. I have seen people do multiplies in loops as part of incrementing the variable or for terminating condition. So try doing addition instead.

Conform to default branch predictions.
Try to set up your code such that backward conditional jumps are usually taken, and forward conditional loops are almost never taken. That has to do with branch prediction. The static branch predictor uses that simple rule to guess if a conditional jump is taken or not. So have a loop that has a backwards conditional jump at the end. And then have special exit conditions from that same loop that executes a forward jump that only exits on a certain condition that doesn't often occur.
Eliminate branches
Eliminate branch where possible. This might seem obvious, but I have seen some people use too many branches in their assembler code. Keep it simple. Use as few branches as possible.
Using CMOVcc to remove branches
I have yet to see the CMOVcc instructions actually be faster than a conditional jump. So I recommend using conditional jumps over CMOVcc. It might be faster in the case where your jumps aren't easily guessable by the branch prediction logic. So if that is the case with you, benchmark it and see.
Local vs. Global variables
Use local variables for a procedure over using a global variable. If you use local variables you'll get less cache misses.
Address Calculation
Compute address calculations before you need them. Let's say you have to do some funky stuff to get to a particular address. Such as multiplying by 20. You can pre-compute that before you get to the point in the code where you need it.

Smaller registers

Sometimes using smaller registers will give you a speed up. I did this on the radix40 code. If you change the below code to use EDX it runs slightly slower.

        movzx           edx,byte ptr [esi]              ;get the data from the ascii array
        test            dl,ILLEGAL                      ;bit 7 set?  if so do ILLEGAL handling
        jnz             skip_illegal

Instruction Length
Try and keep your instructions to 8 bytes or less.
Use registers to pass parameters
Try passing parameters in registers instead of on the stack where possible. If you have 3 variables that you have to push onto the stack as paramaters, that is at least 6 memory reads and 3 memory writes. You have to read each variable from memory into a CPU register and then push it on the stack. That is 3 memory reads right there. Then the 3 pushes onto the stack make 3 writes. Then why would you push parameters you'd never use? So figure at least 3 ( maybe more) reads from the stack of the pushed data.
Don't pass big data on the stack
Don't pass 64-bit data or 128-bit data ( or bigger) on the stack. Instead pass a pointer to the data.

Intermediate

Adding to memory faster than adding memory to a register
This has to do with the number of micro-ops the instruction takes. Give preference to doing an add with memory over adding memory to a register.
```
        add     eax,[edi]       ;don't do this if possible
        add     [edi],eax       ;This is preferred
```
Instruction selection
Try and pick instructions with the fewest micro-ops and shortest latencies.
Handling an unaligned byte data stream that needs to be dword aligned
Parsing a byte array a dword at a time will get you performance hits due to the buffer not being aligned to a 4 byte boundary. You can get around this by dealing with the first X bytes ( 0 to 3), until you come to an aligned to a 4 byte boundary.

Using CMOVcc to reset an infinite loop pointer

If you are making multiple passes through an array, and want to reset it to the beginning when you have reached the end of the array, you can use CMOVcc.

        dec ecx             ; decrement index into array
        cmovz ecx,MAX_COUNT ; if we are at the beginning, then reset the index
                            ; to MAX_COUNT (the end).

Multiplying by 0.5 by doing a subtraction
This probably won't work for everything, but multiplying by 0.5, or dividing by 2.0 in real4 ( in floating point), you can just subtract 1 from the exponent. Won't work with 0.0. For real8, the subtract value is 00100000h ( donkey posted this)
```
.data
        somefp real4    2.5
.code
        sub dword ptr [somefp],00800000h        ;divide real4 by 2.
```
Self Modifying Code
The P4 optimization manual recommends avoiding self-modifying code. I have seen cases where it can run faster. But as always you need to time it to verify in your case that it is faster.
MMX, SSE, SSE2
Most compilers don't generate good code for MMX, SSE and SSE2. GCC and the Intel compiler have gotten a lot better at it. But hand tooled assembler is still a big win in this area.
Using EMMS.
EMMS tends to be a really slow instruction on Intel processors. On AMD it is faster. Generally I don't do it on a per routine basis, because it is so slow. I very rarely use a lot of floating point in a program that I already have a lot of MMX in (and vice versa). So I usually wait to do the EMMS before doing any floating point. If you have a lot of floating point and very little MMX, then do the EMMS at the end of all the MMX routines you call (if you call any). But adding it in every routine that does MMX just makes the code run slow.
Converting to MMX,SSE, or SSE2
Can your code be convert to MMX, SSE, or SSE2? If so you can get a big speed up by doing stuff in parallel.

Prefetching data.

This is underutilized a lot. If you are processing a huge array ( 256KB and up), using the "prefetch" instruction on P3 and up processors can speed up your code anywhere from 10-30%. You can actually get a degradation in performance if you don't use it right. Unrolling works well with this, because I unroll to the number of bytes fetched with this instruction. On a P3 it is 32, but on a P4 it is 128. That means you can easily unroll your loops to handle 128 bytes at a time on a P4 and get the benfit from unrolling and prefetching. It is not always the case that if you unroll it for 128 bytes that you will get the best speed up. So try different variations.

UNROLL_NUM_BYTES equ    4                       ; # of bytes handled in one
                                                ; iteration of the loop.
UNROLL_AMT       equ    128/UNROLL_NUM_BYTES    ; We want to unroll the loop such
                                                ; that we handle 128 bytes per loop.

        mov     ecx,1024
looper:
offset2 = 0
REPEAT UNROLL_AMT
        prefetchnta [edi+offset2+128] ; prefetch 128 bytes into the L1 cache before we need it.
        add     eax,[edi+offset2]
offset2 = offset2 + UNROLL_NUM_BYTES
        add     edi,UNROLL_AMT * UNROLL_NUM_BYTES ;we dealt with 16*4 bytes.
        sub     ecx,UNROLL_AMT        ; subtract from loop counter the # of loops we unrolled.
        jnz     looper

Cache Blocking
Let's say you have to call multiple procedures on this big array in memory. It is better to break it up into blocks that fit into the cache to reduce cache misses. For example if you were doing 3D code, the first procedure might translate your coordinates, the second might scale and the third might rotate. So instead of going through the whole huge array in one fell swoop. You break off a "chunk" of the data that fits into the cache, and then call all 3 procedures, and then go to the next chunk and repeat.
TLB Priming
The TLB is the Translation Lookaside Buffer. The TLB is cache that is used to improve performance of the translation of a virtual memory address to a physical memory address by providing fast access to page table entries. Having it not in the TLB cache forces a cache misse which slows the code down. The trick is to pre-read a data byte from the next page before you have to read it. I will show an example later on in another one of the tips.
Intermix your code to break dependencies.
In C code the C compiler treats different chunks of code as seperate. To break dependencies, when you get to the assembler level you can intermix them

Parallelization.

Most Compilers don't take advantage of the fact you have 2 pipelines for ALU stuff, which is the majority of what people use. On the P4 you have it even sweeter. You can execute 4 ALU instructions in 1 cycle if you do it right. If you break things up into doing stuff in parallel it also helps break up dependencies. So you kill two birds with one stone. Assume this piece is in a loop.

looper:
        mov     eax,[esi]
        xor     eax,0E5h        ;dependency with the line above it.
        add     [edi],eax       ;dependency with the line above it.
        add     esi,4
        add     edi,4
        dec     ecx
        jnz     looper

;So how do we 'parallelize' it and reduce dependencies?
looper:
        mov     eax,[esi]
        mov     ebx,[esi+4]
        xor     eax,0E5
        xor     ebx,0E5
        add     [edi],eax
        add     [edi+4],ebx
        add     esi,8
        add     edi,8
        sub     ecx,2
        jnz     looper

Avoiding memory accesses

Re-structing the code to avoid memory accesses ( or other I/O). One method is to accumulate a value in a register before writing it to memory. Here is an example of that below. In this example assuming we are adding 3 bytes from the source array to the destination array which is an array of dwords. The destination array is zeroed.

        mov     ecx,AMT_TO_LOOP
looper:
        movzx byte ptr eax,[esi]
        add     [edi],eax
        movzx byte ptr eax,[esi+1]
        add     [edi],eax
        movzx byte ptr eax,[esi+3]
        add     [edi],eax
        add     edi,4
        add     esi,3
        dec     ecx
        jnz     looper

We can accumulate the result in a register, and then only do one write to memory.

        mov     ecx,AMT_TO_LOOP
looper:
        xor     edx,edx                 ;zero out register to accumulate the result in.
        movzx byte ptr eax,[esi]
        add     edx,eax
        movzx byte ptr eax,[esi+1]
        add     edx,eax
        movzx byte ptr eax,[esi+3]
        add     edx,eax
        add     esi,3
        mov     [edi],edx
        add     edi,4
        dec     ecx
        jnz     looper

When to convert a call to a jump
if the last statement in a routine is a call consider converting it to a jump to get rid of one call/ret.
Using arrays for data structures
(This is non-assembler related, but it's a great one). You can use an array for data structures such as trees and linked lists. By using an array the memory ends up being contiguous and you get a speed up due to less cache misses.

Advanced

Avoid prefixes
Try to avoid prefixes ( prefixes get generated for a number of things including segment overrides, branch hints, operand-size override, address-size override, LOCKs, and REPs). Prefixes make your instructions longer.

Grouping reads/writes in code

If there is a bunch of alternating between read and write transactions on the bus, look at grouping and doing more reads at a time and more writes at a time. Here is what we are trying to avoid:

        mov eax,[esi]
        mov [edi],eax
        mov eax,[esi+4]
        mov [edi+4],eax
        mov eax,[esi+8]
        mov [edi+8],eax
        mov eax,[esi+12]
        mov [edi+12],eax

;Grouping the reads and writes together this gets converted to
        mov eax,[esi]
        mov ebx,[esi+4]
        mov ecx,[esi+8]
        mov edx,[esi+12]
        mov [edi],eax
        mov [edi+4],ebx
        mov [edi+8],ecx
        mov [edi+12],edx

Making use of execution units to make your code run faster
Choose instructions that execute on different execution units. If you do this properly the time to execute the code will be the throughput time and not the latency time. For most instructions the throughput time is less.
Interleaving 2 loops out of sync
You can unroll a loop twice, and instead of running each instruction after each other, you can run them out of sync. Why is this useful? 2 reasons. First, sometimes you have instructions that have to use a certain register and have a long latency such as MUL or DIV. That creates a dependency on EDX:EAX for the two MUL instructions in a row. Second, sometimes some instructions just have really long latencies. So you want to try and place a number of instructions after it from the other loop to help delay until it returns the result. A lot of MMX, SSE, and SSE2 instructions on the P4 fall into that category. Here is an example loop: A1 ; instruction 1 loop 1 D2 ; instruction 4 loop 2 B1 ; instruction 2 loop 1 A2 ; instruction 1 loop 2 C1 ; instruction 3 loop 1 B2 ; instruction 2 loop 2 D1 ; instruction 4 loop 1 C2 ; instruction 3 loop 2

Different tricks you can do with masks created by comparison instructions using MMX/SSE/SSE2.

With MMX and SSE and SSE2 you can generate masks when doing comparisons. This can be helpful in some cases when looking for a pattern in a file, such as a line feed. So you can use it to search for patterns, not just math operations.

You can use MMX, SSE, and SSE2 masks that get generated by a compare instruction to control doing math on only part of a MMX or SSE register. The following piece of code only adds a 9 to the dword parts of an MMX register if it has a 5 in it.

; if (fredvariable == 5)
;       fredvariable += 9;
    movq     mm5,[two_fives]        ;mm5 has 2 DWORD 5's in it. 
    movq     mm6,[two_nines]        ;mm6 has 2 DWORD 9's in it. 
    movq     mm0,[array_in_memory]  ;get value 
    movq     mm1,mm0                ;get backup copy 
    pcmpeqd  mm1,mm5                ;mm1 now has an FFFFFFFF in each dword location
                                    ; in MM1 with a 5 all other locations have 0. 
    pand     mm1,mm6                ;zero out the locations in MM6 that don't
                                    ; have 5's in them for MM0.
    paddd    mm0,mm1                ;add 9 ONLY to the locations in MM0 that
                                    ; have a 5 in them.

PSHUFD and PSHUFW.
On the P4 MMX, SSE and SSE2 move instructionns are slow. You can get around this by doind "pshufd" for the SSE and SSE2 and "pshufw" for MMX. It is 2 cycles faster. There is one caveat. It has to do with what pipeline the opcode goes down. So without getting too technical, sometimes it is faster to use the slower "MOVDQA" than to replace it with a "PSHUFD". So time your code.
```
        pshufd  xmm0,[edi],0E4h     ;copy 16 bytes at location EDI into XMM0.
                                    ; The 0E4h makes it a straight copy.
        pshufw  mm0,[edi],0E4h      ;copy 8 bytes at location EDI into MM0.
                                    ; The 0E4h makes it a straight copy.
```
Write directly to memory - bypass the cache.
Another optimization dealing with memory. If you have to write to a lot of memory (256KB and up) it is faster to write directly to memory bypassing the cache. If you have a P3 you can use "movntq" or "movntps". The first does an 8 byte write, the second a 16-byte. The 16-byte write needs to be 16-byte aligned. On the P4 you also get "movntdq", which does 16-bytes also, but needs to be 16-byte aligned. This trick applies to both memory fills and memory copies. Both do a write operation. Here is some sample code. I personally would do 8 XMM registers in parallel to help break up some of the latencies for the P4 MOVDQA instruction. However to help understanding, I did not do that.
```
        mov     ecx,16384           ;write 16384 16-byte values, 16384*16 = 256KB.
                                    ; So we are copying a 256KB array
        mov     esi,offset src_arr  ;pointer to the source array which has to be
                                    ; 16-byte aligned or you will get an exception.
        mov     edi,offset dst_arr  ;pointer to the destination array which has to be
                                    ; 16-byte aligned or you will get an exception.
looper: 
        movdqa  xmm0,[esi]          ;works on P3 and up
        movntps [edi],xmm0          ;Works on P3 and up
        add     esi,16
        add     edi,16
        dec     ecx
        jnz     looper
        
```
Handle 2 cases per loop for MMX/SSE/SSE2.
On the P4 the latencies are usually so long on the MMX, SSE and SSE2 instructions that I always handle 2 cases per loop, or read a loop in advance. Or more than 2 cases if I have enough registers. All the various MOVE ( including MOVD) instructions on P4 are slow. So adding 2 32-bit arrays of numbers together is going to be slower on the P4 than the P3. A faster way would be to do two per loop, where you pre-read the loops initial values MM0 and MM1 before the FRED label. You just have to have special handling if you have an odd number of array elements. Just check that at the end, and if so add code for that one extra dword. Here is the code that does not read a value in advance. I think converting this to read a value in advance is easy. So that is why I am not posting both. This is how you would avoid ADC on a P4, which is a slow instruction, to add two arrays together.
```
    pxor    mm7,mm7     ; the previous loops carry stays in here. 
fred:    
    movd    mm0,[esi]   ; esi points to src1 
    movd    mm1,[edi]   ; edi points to src2, also to be used as the destination. 
    paddq   mm0,mm1     ; add both values together 
    paddq   mm0,mm7     ; add in remainder from last add. 
    movd    [edi],mm0   ; save value to memory 
    movq    mm7,mm0 
    psrlq   mm7,32      ; shift the carry over to bit 0. 
    add     esi,8 
    add     edi,8 
    sub     ecx,1 
    jnz     fred 
    movd    [edi],mm7   ; save carry 
```
Pre-reading MMX or XMM register to get around long latency
Pre-reading an SSE2 register before you need it will give a speed up. That is because MOVDQA takes 6 cycles on a P4. That is really slow. So because it has such a long latency, I want to read it in advance of where I use it to make sure it doens't stall. Here is an example. movdqa xmm1,[edi+16] ;read in XMM1 before we use it, takes 6 cycles on P4, not including the time to get it from cache. por xmm5,xmm0 ;do an OR with XMM0 which was previously read. Takes 2 cycles on the P4. pand xmm6,xmm0 ;do an AND with XMM0 which was previously read. Takes 2 cycles on the P4. movdqa xmm2,[edi+32] ;read in XMM2 before we use it, takes 6 cycles on P4, not including the time to get it from cache. por xmm5,xmm1 ;do an OR with XMM1 which was previously read. Takes 2 cycles on the P4. pand xmm6,xmm1 ;do an AND with XMM1 which was previously read. Takes 2 cycles on the P4.

Accumulating a result in a register or registers to avoid doing a slow instruction

Accumulating a result in a register or registers to avoid doing a slow instruction. I did this to speed up a compare/read loop written in SSE2. The slow instruction was PMOVMSKB. So instead of executing it every loop, I accumulated a result in a register. And then every 4KB of memory read, I would do a PMOVMSKB. This gave a good speed up. The example below will also demonstrate using PREFETCH and TLB Priming. Their are 2 loops in the below code. The inner loop is unrolled to 128 bytes( the number of bytes fetched by PREFETCH on a P4). The outer loop is unrolled to 4KB, so that it can do TLB Priming. If you are using a system that doesn't use a 4KB page size, you'd have to do modify the code appropriately. On the system I tested this on ( a Dell Server) with 6.4 GB/s of max memory bandwidth, I was able to do a read and a compare at 5.55 GB/s ( in a non-Windows environment. Under windows it will run slower). I left out the code that at label "compare_failed", for 2 reasons. 1) The cut/paste of code is already big. 2) It doesn't demosntrate any techniques I want to show. The code at "compare_failed" simply does a REP SCASD to find the failing address after PCMPEQD finds it to the nearest 4KB block. This one has a HUGE code example, so I put it last in case you fall asleep reading it ;)


read_compare_pattern_sse2 proc near

                mov         edi,[start_addr]        ;Starting Address
                mov         ecx,[stop_addr]         ;Last addr to NOT test.
                mov         ebx,0FFFFFFFFh          ;AND mask
                movd        xmm6,ebx                ;AND mask
                pshufd      xmm6,xmm6,00000000b     ;AND mask
                movdqa      xmm0,[edi]              ;Get first 16 bytes
                mov         eax,[pattern]           ;EAX holds pattern
                pxor        xmm5,xmm5               ;OR mask
                movd        xmm7,eax                ;Copy EAX to XMM7
                pshufd      xmm7,xmm7,00000000b     ;Blast to all DWORDS
outer_loop:
                mov         ebx,32                  ;128 32 byte blocks
                mov         esi,edi                 ;save start of block

if DO_TLB_PRIMING
                mov         eax,[edi+4096]          ;TLB priming
endif ; if DO_TLB_PRIMING

fred_loop:
                movdqa      xmm1,[edi+16]           ;read 16 bytes
                por         xmm5,xmm0               ;OR into mask
                pand        xmm6,xmm0               ;AND into mask

                movdqa      xmm2,[edi+32]           ;read 16 bytes
                por         xmm5,xmm1               ;OR into mask
                pand        xmm6,xmm1               ;AND into mask

                movdqa      xmm3,[edi+48]           ;read 16 bytes
                por         xmm5,xmm2               ;OR into mask
                pand        xmm6,xmm2               ;AND into mask

                movdqa      xmm0,[edi+64]           ;read 16 bytes
                por         xmm5,xmm3               ;OR into mask
                pand        xmm6,xmm3               ;AND into mask

                movdqa      xmm1,[edi+80]           ;read 16 bytes
                por         xmm5,xmm0               ;OR into mask
                pand        xmm6,xmm0               ;AND into mask

                movdqa      xmm2,[edi+96]           ;read 16 bytes
                por         xmm5,xmm1               ;OR into mask
                pand        xmm6,xmm1               ;AND into mask

                movdqa      xmm3,[edi+112]          ;read 16 bytes
                por         xmm5,xmm2               ;OR into mask
                pand        xmm6,xmm2               ;AND into mask

                por         xmm5,xmm3               ;OR into mask
                prefetchnta [edi+928]               ;Prefetch 928 ahead
                pand        xmm6,xmm3               ;AND into mask

                add         edi,128                 ;Go next 128byteblock
                cmp         edi,ecx                 ;At end?
                jae         do_compare              ;No, jump

                movdqa      xmm0,[edi]              ;read 16 bytes

                sub         ebx,1                   ;Incr for inner loop
                jnz         fred_loop

do_compare:
                pcmpeqd     xmm5,xmm7               ;Equal?
                pmovmskb    eax,xmm5                ;Grab high bits in EAX
                cmp         eax,0FFFFh              ;all set?
                jne         compare_failed          ;No, exit failure

                mov         edx,0FFFFFFFFh          ;AND mask
                pxor        xmm5,xmm5
                pcmpeqd     xmm6,xmm7               ;Equal?

                pmovmskb    eax,xmm6                ;Grab high bits in EAX
                cmp         eax,0FFFFh              ;All Set?
                jne         compare_failed          ;No, exit failure

                movd        xmm6,edx                ;AND mask
                pshufd      xmm6,xmm6,00000000b     ;AND mask

                cmp         edi,ecx                 ;We at end of range
                jb          outer_loop              ;No, loop back up

                jmp         compare_passed          ;Done!!! Success!!!

Prefetch distance and location within the loop
You will notice in the previous example that I prefetch 928 bytes ahead instead of 128, when 128 is the number of bytes fetched on a P4. Why is that? Well Intel recmomends prefetching 128 bytes ( 2 cache lines) ahead at the start of your loop. I found both statements ( doing at the beginning of the loop and prefetching 128 bytes ahead) to be wrong. I don't prefetch at the beginning of the loop nor do I prefetch 128 bytes ahead. Why? Well when I was playing with the code I found that I could make it run faster by moving the PREFETCH instruction around in the loop and changing the offest for how far ahead it prefetches. So being the geek I am I wrote code to try all combinations of the location of the prefetch instruction in the loop and offsets to begin prefetching. The code takes an assembler file and moves the "prefetch" instruction around inside a loop, and also modifies the offset to begin prefetching. Then a batch file compiles the just modified code, and runs a benchmark. I ran the benchmark over several hours to try the different combinations ( I started at a prefetch distance of 32, and went up to 1024, in increments of 32). On the system I wrote the code on 928 ahead instead of 128 was the fastest. And prefetching almost at the end of the loop was fastest ( the prefetchnta instruction is about 8 lines above the do_compare: label)

# # #

visitors since 09 jul 2004:

Assembly Optimization Tips

byMark Larson

Beginner

Intermediate

Advanced

by

Mark Larson