4 X 4 Matrix Multiplication using SSE4.1 DPPS instruction

Mark44 · February 22, 2014, 03:35:24 AM

Here's another of my recent projects - multiplication of two 4 X 4 matrices of floats. I've seen that another member here has posted some work that multiplies a matrix and a vector, so possibly there will be some interest in what I've done here.

By way of review if this isn't fresh in your mind, if A and B are square matrices, the entries of the product AB are gotten by "dotting" each row of A with each column of B. With A and B each being 4 X 4 matrices, there will be 16 dot products to compute, which is where the SSE4.1 DPPS instruction comes in. It multiplies two 128-bit quantities, and produces a 32-bit result. The third parameter of DPPS took me a while to figure out. This 8-bit immediate value controls which pairs of floats contribute to the dot product, as well as where the 32-bit product is placed in the 128-bit result register.

To make things simpler for me, if you first take the transpose of B (denoted as B^T), then you only need to dot each row of A with each row of B transpose. Here is a listing of the part of my code that shows the dot product of row 0 of A with the four rows of B transpose. The other three parts of the code that dot rows 1, 2, and 3 of A with the rows of B transpose are nearly identical.

I have attached the source code, which is a Visual Studio 10 C program with inline assembly code. You need a fairly new CPU to run the code, one that supports SSE 4.1.

Code Select

// Calculate first row of matrix product, and store in row 0 of AB. 

  __asm{
	movaps xmm4, Arow0
	movaps xmm0, BTrow0
	movaps xmm1, BTrow1
	movaps xmm2, BTrow2
	movaps xmm3, BTrow3

	// Calculate dot products of each row of B with row 0 of A.
	dpps xmm0, xmm4, 0xF1
	dpps xmm1, xmm4, 0xF2
	dpps xmm2, xmm4, 0xF4
	dpps xmm3, xmm4, 0xF8

	// Combine all dot products into a single float.
	addps xmm0, xmm1
	addps xmm0, xmm2
	addps xmm0, xmm3
	lea eax, temp
	movaps xmmword ptr[eax], xmm0
  }
// Convert xmmword to an array and store in row 0 of AB.
convertXmmToSPArray((float *)AB, temp);

Gunther · February 22, 2014, 04:14:45 AM

Mark,

thank you for providing that code. I couldn't test it, because gcc doesn't compile it. There are some VS specific constructs and a different inline assembler.

Gunther

Mark44 · February 22, 2014, 05:19:48 AM

Quote from: Gunther on February 22, 2014, 04:14:45 AM
Mark,

thank you for providing that code. I couldn't test it, because gcc doesn't compile it. There are some VS specific constructs and a different inline assembler.

Gunther

Thanks for taking a look, Gunther. I'll look into separating out the assembly code to its own file, but it will take me some time to do that. Apart from the inline assembler, which are the VS-specific constructs that gcc has trouble with? Is it the __m128 type I'm using?

Gunther · February 22, 2014, 05:58:03 AM

Mark,

Quote from: Mark44 on February 22, 2014, 05:19:48 AM
Thanks for taking a look, Gunther. I'll look into separating out the assembly code to its own file, but it will take me some time to do that. Apart from the inline assembler, which are the VS-specific constructs that gcc has trouble with? Is it the __m128 type I'm using?

no the __m128 data type isn't a major problem, because the gcc has a similar intrinsic type. But your __declspec is very MS specific and not portable.

Gunther

Mark44 · February 22, 2014, 06:20:17 AM

Quote from: Gunther on February 22, 2014, 05:58:03 AM
Mark,

Quote from: Mark44 on February 22, 2014, 05:19:48 AM
Thanks for taking a look, Gunther. I'll look into separating out the assembly code to its own file, but it will take me some time to do that. Apart from the inline assembler, which are the VS-specific constructs that gcc has trouble with? Is it the __m128 type I'm using?

no the __m128 data type isn't a major problem, because the gcc has a similar intrinsic type. But your __declspec is very MS specific and not portable.

Gunther

OK, that's an easy fix. The reason for __declspec in the array declarations is to align them on 16-byte boundaries. I believe you can remove those __declspec attributes (don't know what else to call them), and change the movaps instructions to movups (i.e., aligned move to unaligned move). If you just remove the __declspec things but don't change the movaps instruction, I seem to recall that you get runtime stack alignment faults.

TWell · February 22, 2014, 06:32:48 AM

from google:

Code Select

#if defined(_MSC_VER)
    #define __align(_boundary_size) __declspec((align(_boundary_size)))
#else
    #define __align(_boundary_size) __attribute__((aligned(_boundary_size)))
#endif

Gunther · February 22, 2014, 10:35:04 AM

Mark,

Quote from: Mark44 on February 22, 2014, 06:20:17 AM
OK, that's an easy fix. The reason for __declspec in the array declarations is to align them on 16-byte boundaries. I believe you can remove those __declspec attributes (don't know what else to call them), and change the movaps instructions to movups (i.e., aligned move to unaligned move). If you just remove the __declspec things but don't change the movaps instruction, I seem to recall that you get runtime stack alignment faults.

I know the reason for the 16 bit alignment and TWell made a good proposal. If I find enough time, I'll make the necessary changes for compiling your sources with the gcc.

Gunther

Farabi · February 26, 2014, 10:23:45 PM

:t Cool. Never though it can be done on a short code. It might be in the next 3 years I can afford to buy another new laptop.

Gunther · February 26, 2014, 10:39:35 PM

Quote from: Farabi on February 26, 2014, 10:23:45 PM
:t Cool. Never though it can be done on a short code. It might be in the next 3 years I can afford to buy another new laptop.

I hope so. It's a fascinating new technology.

Gunther

The MASM Forum

News:

4 X 4 Matrix Multiplication using SSE4.1 DPPS instruction

Mark44

Gunther

Mark44

Gunther

Mark44

TWell

Gunther

Farabi

Gunther