The most important thing to remember is to TIME your code. Trying
different tricks might or might not speed up your code. So it is very
important to time your code to see if you do get a speedup as you try
each trick.
Beginner
Freeing up all 8 CPU registers for use in your code. This also demonstrates how to save a GP register in an MMX register
push ebx
push esi
push edi
push ebp ; has to be done before changing ESP
;load up ESI and EDI and your other registers with the vales passed
;in on the stack. This has to be done before freeing up ESP.
movd mm0,esp ; no pushing/popping past this point.
xor ebx,ebx ; So how do you save? A variable.
mov esp,5
inner_loop:
mov [eax_save],eax ; eax_save is a global variable. Can't be
movzx eax,word ptr [fred_2] ; local because that requires EBP to point to it.
add ebx,eax
mov eax,[eax_save] ; restore eax
movd esp,mm0 ; has to be done before you do the POPs
pop ebp
pop edi
pop esi
pop ebx
ret
Maximize register usage.
Most high level compilers generate a lot of memory accesses to
variables. I usually get around that by trying to keep most of them in
registers. Having all 8 cpu registers free can really come in handy.
Complex instructions
Avoid complex instructions ( lods, stos, movs, cmps, scas, loop, xadd,
enter, leave). Complex instructions are instructions that do multiple
things. For instance stosb writes a byte to memory and also increments
EDI. They stopped making these fast with the original Pentium because
they were trying to make it more RISC like. Using REP and the string
instructions is still fast. That is the only exception to the case.
Don't use INC/DEC on P4
On the P4 use ADD/SUB in place of INC/DEC. Generally it is faster. ADD/SUB runs in 0.5 cycles. INC/DEC takes 1 cycle.
Rotating
Avoid rotate by a register or rotate by an immediate value of anything but a 1.
Eliminate unnecessary compare instructions
Eliminate unnecessary compare instructions by doing the appropriate
conditional jump instruction based on the flags that are already set
from a previous arithmetic instruction.
dec ecx
cmp ecx,0
jnz loop_again
;gets changed to
dec ecx
jnz loop_again
LEA is still really cool, except for on the P4 it tends to be slow.
You can perform multiple math operations all in one instruction and
it does not affect the flags register so you can put in in between one
register being modified and a flags comparison jump on the next line.
top_of_loop:
dec eax
lea edx,[edx*4+3] ; multiply by 4 and add 3. Does not affect flags
jnz top_of_loop ; so the next instruction doesn't get hosed.
ADC and SBB.
Most compilers don't really make good use of ADC and SBB. You can get
good speeds ups with that. Adding 2 64-bit numbers together, or adding
big numbers together. Keep in mind that on the P4 ADC and SBB are slow.
As a work around you can use "addq", and use MMX to do this. So the
second optimization suggestion for this is to use MMX to do the adding
or subtracting. You just have to have a processor that supports MMX.
add eax,[fred]
adc edx,[fred+4]
; the above 2 statements are the same as the below 3 statements
movd mm0,[fred] ; Get 32-bit value in MM0
movd mm1,[fred+4] ; Get 32-bit value in MM1
paddq mm0,mm1 ; This is an unoptimized way to do it. You would
; really pre-read MM0 and MM1 a loop in advance.
; I did it this way for ease of understanding.
ROL, ROR, RCL, and RCR and BSWAP.
It is a cool trick to switch from Big Endian to Little Endian using
BSWAP. Also you can use it for temporary storage of a 16-bit or 8-bit
value in the upper half of the register. Likewise you can use ROL and
ROR for storing 8-bit and 16-bit values. It's a way to get more
"registers". If all you are dealing with are 16-bit values, you can
turn your 8 32-bit registers into 16 16-bit registers. Which gives you
a lot more registers to use. RCL and RCR can also easily be used for
counting the number of bits that are set in a register. Keep in mind
that ROL, ROR, RCL, RCR and BSWAP are all slow on the P4. The rotate
instructions are about twice as fast as BSWAP. So if you have to use
one or the other on the P4 use the rotate ones.
xor edx,edx ; set both 16-bit registers to 0
mov dx,234 ; set the first 16-bit register to 234
bswap edx ; swap it so the second one is ready
mov dx,345 ; set the second 16-bit register to 345
bswap edx ; swap to the first one
add dx,5 ; add 5 to the first one
bswap edx ; swap to the second one
add dx,7 ; add 7 to it
String instructions.
Most compilers don't make good use of the string instructions ( scas,
cmps, stos, movs, and lods). So checking to see if that is faster than
some library routine can be a win. For instance I was really surprised
when I looked at strlen() in VC++. In the radix40 code it ran in 416
cycles for a 100 byte string!!! I thought that was absurdly slow.
Multiply to divide.
If you have a full 32-bit number and you need to divide, you can simply
do a multiply and take the top 32-bit half as the result. This is
faster because multiplication is faster than division. ( thanks to
pdixon for the tip).
Dividing by a constant.
There's some nice information now how to divide by a constant in Agner
Fog's pentopt.pdf document. I wrote a program that you pass in the
number you want to divide by, and it will print out the assembler code
sequence. I will dig it up later and post it. Here is a link to Agner's
document.
Agner's Pentopt PDF (http://www.agner.org/assem/)
Unrolling.
This is a guideline. Unrolling falls in the General Optimiation
category but I wanted to add a footnote. I always set up my Unrolling
with a macro that unrolls an EQUATE value amount. That way you can try
different values and see which is best easily. You want the unrolling
to fit in the L1 code cache ( or trace cache). Using an equate makes it
easy to try different unroll amounts to find the fastest one.
UNROLL_AMT equ 16 ; # of times to unroll the loop
UNROLL_NUM_BYTES equ 4 ; # of bytes handled in 1 loop iteration
mov ecx,1024
looper:
offset2 = 0
REPEAT UNROLL_AMT
add eax,[edi+offset2]
offset2 = offset2 + UNROLL_NUM_BYTES
add edi,UNROLL_AMT * UNROLL_NUM_BYTES ; we dealt with 16*4 bytes.
sub ecx,UNROLL_AMT ; subtract from loop counter the # of loops we unrolled.
jnz looper
MOVZX.
Use MOVZX to avoid partial register stalls. I use MOVZX a lot. A lot
of people XOR the full 32-bit register first. But MOVZX does the
equivalent thing without having to have an extra XOR instruction. Also
you had to do the XOR enough in advance to give it time to complete.
With MOVZX you don't have to worry about that.
Using MOVZX to avoid a SHIFT and AND instruction
I ran across this bit of C code I was trying to speed up using
assembler. The_array is a dword array. The code is trying to get a
different byte from a dword in the array passed upon which pass this is
over the data. "Pass" is a variable that goes from 0 to 3 for each byte
in a particular dword.
unsigned char c = ((the_array[i])>>(Pass<<3)) & 0xFF;
; I got rid of the "pass" variable by unrolling the loop 4 times.
; So I had 4 of these each one seperated by lots of C code.
unsigned char c = (the_array[i])>>0) & 0xFF;
unsigned char c = (the_array[i])>>8) & 0xFF;
unsigned char c = (the_array[i])>>16) & 0xFF;
unsigned char c = (the_array[i])>>24) & 0xFF;
What if I can get rid of the SHIFT and the AND using assembler?
That would save me 2 instructions. Not to mention the fact that the P4
is very slow when doing SHIFT instructions ( 4 cycles!!!). So try to
avoid shifts where possible. SO taking just the second to last line
that shifts right 16 as our example
; esi points to the_array
mov eax,[esi]
shr eax,16
and eax,0FFh
; So how do we change that to get rid of the AND and SHR?
; We do a MOVZx with the 3rd byte in the dword.
movzx eax,byte ptr [esi+2] ;unsigned char c = (the_array[i])>>16) & 0xFF;
Align, align, align.
It is really important to align both your code and data to get a
good speed up. I generally align code on 4 byte boundaries. For data I
align 2 byte data on 2 byte boundaries, 4 byte data on 4 byte
boundaries, 8 byte data on 8 byte boundaries, 16 byte data on 16 byte
boundaries. In general if you don't align your SSE or SSE2 data on a
16-byte boundary you will get an exception. You can align your data in
VC++ if you have the processor pack. They added support for both static
data and dynamic memory. For static data you use __declspec(align(4)) -
alignes on a 4 byte boundary.
BSR for powers of 2.
You can use BSR to count the highest power of 2 that goes into a variable.
XORing a register with itself to zero it.
This is an oldie, but I am including it anyway. It also has a side
benefit of clearing dependencies on the register. That is why sometimes
you will see people use XOR in that fashion, before doing a partial
register access. I prefer using MOVZX to doing it that way because it
is trickier to do using a XOR ( read my above comments about in #12
above talking about MOVZX) . On the P4 they also added support for PXOR
to break dependencies in that fashion. I think the P3 does the same
thing.
Use XOR and DIV.
If you know your data can be unsigned for a DIVISION, use XOR EDX, EDX, then DIV. It's faster than CDQ and IDIV.
Try to avoid obvious dependencies.
If you modify a register and then compare it to some value on the very
next line, instead try and put some other register modification in
between. Dependencies are any time you modify a register and then read
it or write it shortly afterwards.
inc edi
inc eax
cmp eax,1 ; this line has a dependency with the previous line, so it will stall.
jz fred
;shuffling the instructions around we can help break up dependencies.
inc eax
inc edi
cmp eax,1
jz fred
Instructions to avoid on P4.
On P4's try to avoid the following instructions, adc, sbb, rotate
instructions, shift instructions, inc, dec, lea, and any instruction
taking more than 4 uops. How do you know the processor running the code
is a P4? CPUID.
Using lookup tables.
On the P4 sometimes you can get around the long latency instructions
that I listed previously by doing lookup tables. Thankfully on P4's
they come with really fast memory. So having to do a lookup table
doesn't hurt performance as much if it isn't in the cache.
Use pointers instead of calculating indexes.
A lot of times in loops in C there will be multiplications by
non-powers of 2 numbers. You can easily get around this by adding
instead. Here is an example that uses a structure.
typedef struct fred
{
int fred;
char bif;
} freddy_type;
freddy_type charmin[80];
The size of freddy_type is 5 bytes. If you try and access them in a
loop the compiler will generate code for multipling by 5 for each array
access!!!! (Ewwwwwwwwwwwww). So how do we do it properly?
for ( int t = 0; t < 80; t++)
{
charmin[t].fred = rand(); // the compiler multiplies by 5 to get the offset, EWWWWWWWW!
charmin[t].bif = (char)(rand() % 256);
}
; in assembler we start with an offset of 0, that points to the first data item.
; And then we add 5 to it each loop iteration to avoid the MUL.
mov esi,offset charmin
mov ecx,80
fred_loop:
;... perform operations on the FRED and BIF elements in freddy_type
add esi,5 ;make it point to the next structure entry.
dec ecx
jnz fred_loop
The MUL removal applies to loops as well. I have seen people do
multiplies in loops as part of incrementing the variable or for
terminating condition. So try doing addition instead.
Conform to default branch predictions.
Try to set up your code such that backward conditional jumps are
usually taken, and forward conditional loops are almost never taken.
That has to do with branch prediction. The static branch predictor uses
that simple rule to guess if a conditional jump is taken or not. So
have a loop that has a backwards conditional jump at the end. And then
have special exit conditions from that same loop that executes a
forward jump that only exits on a certain condition that doesn't often
occur.
Eliminate branches
Eliminate branch where possible. This might seem obvious, but I have
seen some people use too many branches in their assembler code. Keep it
simple. Use as few branches as possible.
Using CMOVcc to remove branches
I have yet to see the CMOVcc instructions actually be faster than a
conditional jump. So I recommend using conditional jumps over CMOVcc.
It might be faster in the case where your jumps aren't easily guessable
by the branch prediction logic. So if that is the case with you,
benchmark it and see.
Local vs. Global variables
Use local variables for a procedure over using a global variable. If you use local variables you'll get less cache misses.
Address Calculation
Compute address calculations before you need them. Let's say you
have to do some funky stuff to get to a particular address. Such as
multiplying by 20. You can pre-compute that before you get to the point
in the code where you need it.
Smaller registers
Sometimes using smaller registers will give you a speed up. I did
this on the radix40 code. If you change the below code to use EDX it
runs slightly slower.
movzx edx,byte ptr [esi] ;get the data from the ascii array
test dl,ILLEGAL ;bit 7 set? if so do ILLEGAL handling
jnz skip_illegal
Instruction Length
Try and keep your instructions to 8 bytes or less.
Use registers to pass parameters
Try passing parameters in registers instead of on the stack where
possible. If you have 3 variables that you have to push onto the stack
as paramaters, that is at least 6 memory reads and 3 memory writes. You
have to read each variable from memory into a CPU register and then
push it on the stack. That is 3 memory reads right there. Then the 3
pushes onto the stack make 3 writes. Then why would you push parameters
you'd never use? So figure at least 3 ( maybe more) reads from the
stack of the pushed data.
Don't pass big data on the stack
Don't pass 64-bit data or 128-bit data ( or bigger) on the stack. Instead pass a pointer to the data.
Intermediate
Adding to memory faster than adding memory to a register
This has to do with the number of micro-ops the instruction takes.
Give preference to doing an add with memory over adding memory to a
register.
add eax,[edi] ;don't do this if possible
add [edi],eax ;This is preferred
Instruction selection
Try and pick instructions with the fewest micro-ops and shortest latencies.
Handling an unaligned byte data stream that needs to be dword aligned
Parsing a byte array a dword at a time will get you performance hits
due to the buffer not being aligned to a 4 byte boundary. You can get
around this by dealing with the first X bytes ( 0 to 3), until you come
to an aligned to a 4 byte boundary.
Using CMOVcc to reset an infinite loop pointer
If you are making multiple passes through an array, and want to
reset it to the beginning when you have reached the end of the array,
you can use CMOVcc.
dec ecx ; decrement index into array
cmovz ecx,MAX_COUNT ; if we are at the beginning, then reset the index
; to MAX_COUNT (the end).
Multiplying by 0.5 by doing a subtraction
This probably won't work for everything, but multiplying by 0.5, or
dividing by 2.0 in real4 ( in floating point), you can just subtract 1
from the exponent. Won't work with 0.0. For real8, the subtract value
is 00100000h ( donkey posted this)
.data
somefp real4 2.5
.code
sub dword ptr [somefp],00800000h ;divide real4 by 2.
Self Modifying Code
The P4 optimization manual recommends avoiding self-modifying code.
I have seen cases where it can run faster. But as always you need to
time it to verify in your case that it is faster.
MMX, SSE, SSE2
Most compilers don't generate good code for MMX, SSE and SSE2. GCC
and the Intel compiler have gotten a lot better at it. But hand tooled
assembler is still a big win in this area.
Using EMMS.
EMMS tends to be a really slow instruction on Intel processors. On AMD
it is faster. Generally I don't do it on a per routine basis, because
it is so slow. I very rarely use a lot of floating point in a program
that I already have a lot of MMX in (and vice versa). So I usually wait
to do the EMMS before doing any floating point. If you have a lot of
floating point and very little MMX, then do the EMMS at the end of all
the MMX routines you call (if you call any). But adding it in every
routine that does MMX just makes the code run slow.
Converting to MMX,SSE, or SSE2
Can your code be convert to MMX, SSE, or SSE2? If so you can get a big speed up by doing stuff in parallel.
Prefetching data.
This is underutilized a lot. If you are processing a huge array ( 256KB
and up), using the "prefetch" instruction on P3 and up processors can
speed up your code anywhere from 10-30%. You can actually get a
degradation in performance if you don't use it right. Unrolling works
well with this, because I unroll to the number of bytes fetched with
this instruction. On a P3 it is 32, but on a P4 it is 128. That means
you can easily unroll your loops to handle 128 bytes at a time on a P4
and get the benfit from unrolling and prefetching. It is not always the
case that if you unroll it for 128 bytes that you will get the best
speed up. So try different variations.
UNROLL_NUM_BYTES equ 4 ; # of bytes handled in one
; iteration of the loop.
UNROLL_AMT equ 128/UNROLL_NUM_BYTES ; We want to unroll the loop such
; that we handle 128 bytes per loop.
mov ecx,1024
looper:
offset2 = 0
REPEAT UNROLL_AMT
prefetchnta [edi+offset2+128] ; prefetch 128 bytes into the L1 cache before we need it.
add eax,[edi+offset2]
offset2 = offset2 + UNROLL_NUM_BYTES
add edi,UNROLL_AMT * UNROLL_NUM_BYTES ;we dealt with 16*4 bytes.
sub ecx,UNROLL_AMT ; subtract from loop counter the # of loops we unrolled.
jnz looper
Cache Blocking
Let's say you have to call multiple procedures on this big array in
memory. It is better to break it up into blocks that fit into the cache
to reduce cache misses. For example if you were doing 3D code, the
first procedure might translate your coordinates, the second might
scale and the third might rotate. So instead of going through the whole
huge array in one fell swoop. You break off a "chunk" of the data that
fits into the cache, and then call all 3 procedures, and then go to the
next chunk and repeat.
TLB Priming
The TLB is the Translation Lookaside Buffer. The TLB is cache that is
used to improve performance of the translation of a virtual memory
address to a physical memory address by providing fast access to page
table entries. Having it not in the TLB cache forces a cache misse
which slows the code down. The trick is to pre-read a data byte from
the next page before you have to read it. I will show an example later
on in another one of the tips.
Intermix your code to break dependencies.
In C code the C compiler treats different chunks of code as
seperate. To break dependencies, when you get to the assembler level
you can intermix them
Parallelization.
Most Compilers don't take advantage of the fact you have 2 pipelines
for ALU stuff, which is the majority of what people use. On the P4 you
have it even sweeter. You can execute 4 ALU instructions in 1 cycle if
you do it right. If you break things up into doing stuff in parallel it
also helps break up dependencies. So you kill two birds with one stone.
Assume this piece is in a loop.
looper:
mov eax,[esi]
xor eax,0E5h ;dependency with the line above it.
add [edi],eax ;dependency with the line above it.
add esi,4
add edi,4
dec ecx
jnz looper
;So how do we 'parallelize' it and reduce dependencies?
looper:
mov eax,[esi]
mov ebx,[esi+4]
xor eax,0E5
xor ebx,0E5
add [edi],eax
add [edi+4],ebx
add esi,8
add edi,8
sub ecx,2
jnz looper
Avoiding memory accesses
Re-structing the code to avoid memory accesses ( or other I/O). One
method is to accumulate a value in a register before writing it to
memory. Here is an example of that below. In this example assuming we
are adding 3 bytes from the source array to the destination array which
is an array of dwords. The destination array is zeroed.
We can accumulate the result in a register, and then only do one write to memory.
mov ecx,AMT_TO_LOOP
looper:
xor edx,edx ;zero out register to accumulate the result in.
movzx byte ptr eax,[esi]
add edx,eax
movzx byte ptr eax,[esi+1]
add edx,eax
movzx byte ptr eax,[esi+3]
add edx,eax
add esi,3
mov [edi],edx
add edi,4
dec ecx
jnz looper
When to convert a call to a jump
if the last statement in a routine is a call consider converting it to a jump to get rid of one call/ret.
Using arrays for data structures
(This is non-assembler related, but it's a great one). You can use
an array for data structures such as trees and linked lists. By using
an array the memory ends up being contiguous and you get a speed up due
to less cache misses.
Advanced
Avoid prefixes
Try to avoid prefixes ( prefixes get generated for a number of
things including segment overrides, branch hints, operand-size
override, address-size override, LOCKs, and REPs). Prefixes make your
instructions longer.
Grouping reads/writes in code
If there is a bunch of alternating between read and write
transactions on the bus, look at grouping and doing more reads at a
time and more writes at a time. Here is what we are trying to avoid:
mov eax,[esi]
mov [edi],eax
mov eax,[esi+4]
mov [edi+4],eax
mov eax,[esi+8]
mov [edi+8],eax
mov eax,[esi+12]
mov [edi+12],eax
;Grouping the reads and writes together this gets converted to
mov eax,[esi]
mov ebx,[esi+4]
mov ecx,[esi+8]
mov edx,[esi+12]
mov [edi],eax
mov [edi+4],ebx
mov [edi+8],ecx
mov [edi+12],edx
Making use of execution units to make your code run faster
Choose instructions that execute on different execution units. If
you do this properly the time to execute the code will be the
throughput time and not the latency time. For most instructions the
throughput time is less.
Interleaving 2 loops out of sync
You can unroll a loop twice, and instead of running each instruction
after each other, you can run them out of sync. Why is this useful? 2
reasons. First, sometimes you have instructions that have to use a
certain register and have a long latency such as MUL or DIV. That
creates a dependency on EDX:EAX for the two MUL instructions in a row.
Second, sometimes some instructions just have really long latencies. So
you want to try and place a number of instructions after it from the
other loop to help delay until it returns the result. A lot of MMX,
SSE, and SSE2 instructions on the P4 fall into that category. Here is
an example loop:
A1 ; instruction 1 loop 1
D2 ; instruction 4 loop 2
B1 ; instruction 2 loop 1 A2 ; instruction 1 loop 2
C1 ; instruction 3 loop 1
B2 ; instruction 2 loop 2
D1 ; instruction 4 loop 1
C2 ; instruction 3 loop 2
Different tricks you can do with masks created by comparison instructions using MMX/SSE/SSE2.
With MMX and SSE and SSE2 you can generate masks when doing
comparisons. This can be helpful in some cases when looking for a
pattern in a file, such as a line feed. So you can use it to search for
patterns, not just math operations.
You can use MMX, SSE, and SSE2 masks that get generated by a compare
instruction to control doing math on only part of a MMX or SSE
register. The following piece of code only adds a 9 to the dword parts
of an MMX register if it has a 5 in it.
; if (fredvariable == 5)
; fredvariable += 9;
movq mm5,[two_fives] ;mm5 has 2 DWORD 5's in it.
movq mm6,[two_nines] ;mm6 has 2 DWORD 9's in it.
movq mm0,[array_in_memory] ;get value
movq mm1,mm0 ;get backup copy
pcmpeqd mm1,mm5 ;mm1 now has an FFFFFFFF in each dword location
; in MM1 with a 5 all other locations have 0.
pand mm1,mm6 ;zero out the locations in MM6 that don't
; have 5's in them for MM0.
paddd mm0,mm1 ;add 9 ONLY to the locations in MM0 that
; have a 5 in them.
PSHUFD and PSHUFW.
On the P4 MMX, SSE and SSE2 move instructionns are slow. You can get
around this by doind "pshufd" for the SSE and SSE2 and "pshufw" for
MMX. It is 2 cycles faster. There is one caveat. It has to do with what
pipeline the opcode goes down. So without getting too technical,
sometimes it is faster to use the slower "MOVDQA" than to replace it
with a "PSHUFD". So time your code.
pshufd xmm0,[edi],0E4h ;copy 16 bytes at location EDI into XMM0.
; The 0E4h makes it a straight copy.
pshufw mm0,[edi],0E4h ;copy 8 bytes at location EDI into MM0.
; The 0E4h makes it a straight copy.
Write directly to memory - bypass the cache.
Another optimization dealing with memory. If you have to write to a lot
of memory (256KB and up) it is faster to write directly to memory
bypassing the cache. If you have a P3 you can use "movntq" or
"movntps". The first does an 8 byte write, the second a 16-byte. The
16-byte write needs to be 16-byte aligned. On the P4 you also get
"movntdq", which does 16-bytes also, but needs to be 16-byte aligned.
This trick applies to both memory fills and memory copies. Both do a
write operation. Here is some sample code. I personally would do 8 XMM
registers in parallel to help break up some of the latencies for the P4
MOVDQA instruction. However to help understanding, I did not do that.
mov ecx,16384 ;write 16384 16-byte values, 16384*16 = 256KB.
; So we are copying a 256KB array
mov esi,offset src_arr ;pointer to the source array which has to be
; 16-byte aligned or you will get an exception.
mov edi,offset dst_arr ;pointer to the destination array which has to be
; 16-byte aligned or you will get an exception.
looper:
movdqa xmm0,[esi] ;works on P3 and up
movntps [edi],xmm0 ;Works on P3 and up
add esi,16
add edi,16
dec ecx
jnz looper
Handle 2 cases per loop for MMX/SSE/SSE2.
On the P4 the latencies are usually so long on the MMX, SSE and SSE2
instructions that I always handle 2 cases per loop, or read a loop in
advance. Or more than 2 cases if I have enough registers. All the
various MOVE ( including MOVD) instructions on P4 are slow. So adding 2
32-bit arrays of numbers together is going to be slower on the P4 than
the P3. A faster way would be to do two per loop, where you pre-read
the loops initial values MM0 and MM1 before the FRED label. You just
have to have special handling if you have an odd number of array
elements. Just check that at the end, and if so add code for that one
extra dword. Here is the code that does not read a value in advance. I
think converting this to read a value in advance is easy. So that is
why I am not posting both. This is how you would avoid ADC on a P4,
which is a slow instruction, to add two arrays together.
pxor mm7,mm7 ; the previous loops carry stays in here.
fred:
movd mm0,[esi] ; esi points to src1
movd mm1,[edi] ; edi points to src2, also to be used as the destination.
paddq mm0,mm1 ; add both values together
paddq mm0,mm7 ; add in remainder from last add.
movd [edi],mm0 ; save value to memory
movq mm7,mm0
psrlq mm7,32 ; shift the carry over to bit 0.
add esi,8
add edi,8
sub ecx,1
jnz fred
movd [edi],mm7 ; save carry
Pre-reading MMX or XMM register to get around long latency
Pre-reading an SSE2 register before you need it will give a speed
up. That is because MOVDQA takes 6 cycles on a P4. That is really slow.
So because it has such a long latency, I want to read it in advance of
where I use it to make sure it doens't stall. Here is an example.
movdqa xmm1,[edi+16] ;read in XMM1 before we use it, takes 6 cycles on
P4, not including the time to get it from cache. por xmm5,xmm0 ;do an
OR with XMM0 which was previously read. Takes 2 cycles on the P4. pand
xmm6,xmm0 ;do an AND with XMM0 which was previously read. Takes 2
cycles on the P4. movdqa xmm2,[edi+32] ;read in XMM2 before we use it,
takes 6 cycles on P4, not including the time to get it from cache. por
xmm5,xmm1 ;do an OR with XMM1 which was previously read. Takes 2 cycles
on the P4. pand xmm6,xmm1 ;do an AND with XMM1 which was previously
read. Takes 2 cycles on the P4.
Accumulating a result in a register or registers to avoid doing a slow instruction
Accumulating a result in a register or registers to avoid doing a
slow instruction. I did this to speed up a compare/read loop written in
SSE2. The slow instruction was PMOVMSKB. So instead of executing it
every loop, I accumulated a result in a register. And then every 4KB of
memory read, I would do a PMOVMSKB. This gave a good speed up. The
example below will also demonstrate using PREFETCH and TLB Priming.
Their are 2 loops in the below code. The inner loop is unrolled to 128
bytes( the number of bytes fetched by PREFETCH on a P4). The outer loop
is unrolled to 4KB, so that it can do TLB Priming. If you are using a
system that doesn't use a 4KB page size, you'd have to do modify the
code appropriately. On the system I tested this on ( a Dell Server)
with 6.4 GB/s of max memory bandwidth, I was able to do a read and a
compare at 5.55 GB/s ( in a non-Windows environment. Under windows it
will run slower). I left out the code that at label "compare_failed",
for 2 reasons. 1) The cut/paste of code is already big. 2) It doesn't
demosntrate any techniques I want to show. The code at "compare_failed"
simply does a REP SCASD to find the failing address after PCMPEQD finds
it to the nearest 4KB block. This one has a HUGE code example, so I put
it last in case you fall asleep reading it ;)
read_compare_pattern_sse2 proc near
mov edi,[start_addr] ;Starting Address
mov ecx,[stop_addr] ;Last addr to NOT test.
mov ebx,0FFFFFFFFh ;AND mask
movd xmm6,ebx ;AND mask
pshufd xmm6,xmm6,00000000b ;AND mask
movdqa xmm0,[edi] ;Get first 16 bytes
mov eax,[pattern] ;EAX holds pattern
pxor xmm5,xmm5 ;OR mask
movd xmm7,eax ;Copy EAX to XMM7
pshufd xmm7,xmm7,00000000b ;Blast to all DWORDS
outer_loop:
mov ebx,32 ;128 32 byte blocks
mov esi,edi ;save start of block
if DO_TLB_PRIMING
mov eax,[edi+4096] ;TLB priming
endif ; if DO_TLB_PRIMING
fred_loop:
movdqa xmm1,[edi+16] ;read 16 bytes
por xmm5,xmm0 ;OR into mask
pand xmm6,xmm0 ;AND into mask
movdqa xmm2,[edi+32] ;read 16 bytes
por xmm5,xmm1 ;OR into mask
pand xmm6,xmm1 ;AND into mask
movdqa xmm3,[edi+48] ;read 16 bytes
por xmm5,xmm2 ;OR into mask
pand xmm6,xmm2 ;AND into mask
movdqa xmm0,[edi+64] ;read 16 bytes
por xmm5,xmm3 ;OR into mask
pand xmm6,xmm3 ;AND into mask
movdqa xmm1,[edi+80] ;read 16 bytes
por xmm5,xmm0 ;OR into mask
pand xmm6,xmm0 ;AND into mask
movdqa xmm2,[edi+96] ;read 16 bytes
por xmm5,xmm1 ;OR into mask
pand xmm6,xmm1 ;AND into mask
movdqa xmm3,[edi+112] ;read 16 bytes
por xmm5,xmm2 ;OR into mask
pand xmm6,xmm2 ;AND into mask
por xmm5,xmm3 ;OR into mask
prefetchnta [edi+928] ;Prefetch 928 ahead
pand xmm6,xmm3 ;AND into mask
add edi,128 ;Go next 128byteblock
cmp edi,ecx ;At end?
jae do_compare ;No, jump
movdqa xmm0,[edi] ;read 16 bytes
sub ebx,1 ;Incr for inner loop
jnz fred_loop
do_compare:
pcmpeqd xmm5,xmm7 ;Equal?
pmovmskb eax,xmm5 ;Grab high bits in EAX
cmp eax,0FFFFh ;all set?
jne compare_failed ;No, exit failure
mov edx,0FFFFFFFFh ;AND mask
pxor xmm5,xmm5
pcmpeqd xmm6,xmm7 ;Equal?
pmovmskb eax,xmm6 ;Grab high bits in EAX
cmp eax,0FFFFh ;All Set?
jne compare_failed ;No, exit failure
movd xmm6,edx ;AND mask
pshufd xmm6,xmm6,00000000b ;AND mask
cmp edi,ecx ;We at end of range
jb outer_loop ;No, loop back up
jmp compare_passed ;Done!!! Success!!!
Prefetch distance and location within the loop
You will notice in the previous example that I prefetch 928 bytes
ahead instead of 128, when 128 is the number of bytes fetched on a P4.
Why is that? Well Intel recmomends prefetching 128 bytes ( 2 cache
lines) ahead at the start of your loop. I found both statements ( doing
at the beginning of the loop and prefetching 128 bytes ahead) to be
wrong. I don't prefetch at the beginning of the loop nor do I prefetch
128 bytes ahead. Why? Well when I was playing with the code I found
that I could make it run faster by moving the PREFETCH instruction
around in the loop and changing the offest for how far ahead it
prefetches. So being the geek I am I wrote code to try all combinations
of the location of the prefetch instruction in the loop and offsets to
begin prefetching. The code takes an assembler file and moves the
"prefetch" instruction around inside a loop, and also modifies the
offset to begin prefetching. Then a batch file compiles the just
modified code, and runs a benchmark. I ran the benchmark over several
hours to try the different combinations ( I started at a prefetch
distance of 32, and went up to 1024, in increments of 32). On the
system I wrote the code on 928 ahead instead of 128 was the fastest.
And prefetching almost at the end of the loop was fastest ( the
prefetchnta instruction is about 8 lines above the do_compare: label)