Here's another of my recent projects - multiplication of two 4 X 4 matrices of floats. I've seen that another member here has posted some work that multiplies a matrix and a vector, so possibly there will be some interest in what I've done here.

By way of review if this isn't fresh in your mind, if A and B are square matrices, the entries of the product AB are gotten by "dotting" each row of A with each column of B. With A and B each being 4 X 4 matrices, there will be 16 dot products to compute, which is where the SSE4.1 DPPS instruction comes in. It multiplies two 128-bit quantities, and produces a 32-bit result. The third parameter of DPPS took me a while to figure out. This 8-bit immediate value controls which pairs of floats contribute to the dot product, as well as where the 32-bit product is placed in the 128-bit result register.

To make things simpler for me, if you first take the transpose of B (denoted as B^T), then you only need to dot each row of A with each row of B transpose. Here is a listing of the part of my code that shows the dot product of row 0 of A with the four rows of B transpose. The other three parts of the code that dot rows 1, 2, and 3 of A with the rows of B transpose are nearly identical.

I have attached the source code, which is a Visual Studio 10 C program with inline assembly code. You need a fairly new CPU to run the code, one that supports SSE 4.1.

`// Calculate first row of matrix product, and store in row 0 of AB. `

__asm{

movaps xmm4, Arow0

movaps xmm0, BTrow0

movaps xmm1, BTrow1

movaps xmm2, BTrow2

movaps xmm3, BTrow3

// Calculate dot products of each row of B with row 0 of A.

dpps xmm0, xmm4, 0xF1

dpps xmm1, xmm4, 0xF2

dpps xmm2, xmm4, 0xF4

dpps xmm3, xmm4, 0xF8

// Combine all dot products into a single float.

addps xmm0, xmm1

addps xmm0, xmm2

addps xmm0, xmm3

lea eax, temp

movaps xmmword ptr[eax], xmm0

}

// Convert xmmword to an array and store in row 0 of AB.

convertXmmToSPArray((float *)AB, temp);