Here's another of my recent projects - multiplication of two 4 X 4 matrices of floats. I've seen that another member here has posted some work that multiplies a matrix and a vector, so possibly there will be some interest in what I've done here.
By way of review if this isn't fresh in your mind, if A and B are square matrices, the entries of the product AB are gotten by "dotting" each row of A with each column of B. With A and B each being 4 X 4 matrices, there will be 16 dot products to compute, which is where the SSE4.1 DPPS instruction comes in. It multiplies two 128-bit quantities, and produces a 32-bit result. The third parameter of DPPS took me a while to figure out. This 8-bit immediate value controls which pairs of floats contribute to the dot product, as well as where the 32-bit product is placed in the 128-bit result register.
To make things simpler for me, if you first take the transpose of B (denoted as B^T), then you only need to dot each row of A with each row of B transpose. Here is a listing of the part of my code that shows the dot product of row 0 of A with the four rows of B transpose. The other three parts of the code that dot rows 1, 2, and 3 of A with the rows of B transpose are nearly identical.
I have attached the source code, which is a Visual Studio 10 C program with inline assembly code. You need a fairly new CPU to run the code, one that supports SSE 4.1.
// Calculate first row of matrix product, and store in row 0 of AB.
__asm{
movaps xmm4, Arow0
movaps xmm0, BTrow0
movaps xmm1, BTrow1
movaps xmm2, BTrow2
movaps xmm3, BTrow3
// Calculate dot products of each row of B with row 0 of A.
dpps xmm0, xmm4, 0xF1
dpps xmm1, xmm4, 0xF2
dpps xmm2, xmm4, 0xF4
dpps xmm3, xmm4, 0xF8
// Combine all dot products into a single float.
addps xmm0, xmm1
addps xmm0, xmm2
addps xmm0, xmm3
lea eax, temp
movaps xmmword ptr[eax], xmm0
}
// Convert xmmword to an array and store in row 0 of AB.
convertXmmToSPArray((float *)AB, temp);