Author Topic: 4 X 4 Matrix Multiplication using SSE4.1 DPPS instruction  (Read 4165 times)

Mark44

  • Regular Member
  • *
  • Posts: 40
4 X 4 Matrix Multiplication using SSE4.1 DPPS instruction
« on: February 22, 2014, 03:35:24 AM »
Here's another of my recent projects - multiplication of two 4 X 4 matrices of floats. I've seen that another member here has posted some work that multiplies a matrix and a vector, so possibly there will be some interest in what I've done here.

By way of review if this isn't fresh in your mind, if A and B are square matrices, the entries of the product AB are gotten by "dotting" each row of A with each column of B. With A and B each being 4 X 4 matrices, there will be 16 dot products to compute, which is where the SSE4.1 DPPS instruction comes in. It multiplies two 128-bit quantities, and produces a 32-bit result. The third parameter of DPPS took me a while to figure out. This 8-bit immediate value controls which pairs of floats contribute to the dot product, as well as where the 32-bit product is placed in the 128-bit result register.

To make things simpler for me, if you first take the transpose of B (denoted as B^T), then you only need to dot each row of A with each row of B transpose. Here is a listing of the part of my code that shows the dot product of row 0 of A with the four rows of B transpose. The other three parts of the code that dot rows 1, 2, and 3 of A with the rows of B transpose are nearly identical.

I have attached the source code, which is a Visual Studio 10 C program with inline assembly code. You need a fairly new CPU to run the code, one that supports SSE 4.1.
 
Code: [Select]
// Calculate first row of matrix product, and store in row 0 of AB.

  __asm{
movaps xmm4, Arow0
movaps xmm0, BTrow0
movaps xmm1, BTrow1
movaps xmm2, BTrow2
movaps xmm3, BTrow3

// Calculate dot products of each row of B with row 0 of A.
dpps xmm0, xmm4, 0xF1
dpps xmm1, xmm4, 0xF2
dpps xmm2, xmm4, 0xF4
dpps xmm3, xmm4, 0xF8

// Combine all dot products into a single float.
addps xmm0, xmm1
addps xmm0, xmm2
addps xmm0, xmm3
lea eax, temp
movaps xmmword ptr[eax], xmm0
  }
// Convert xmmword to an array and store in row 0 of AB.
convertXmmToSPArray((float *)AB, temp);


« Last Edit: February 27, 2014, 02:02:51 AM by Mark44 »

Gunther

  • Member
  • *****
  • Posts: 3515
  • Forgive your enemies, but never forget their names
Re: 4 X 4 Matrix Multiplication using SSE4.1 DPPS instruction
« Reply #1 on: February 22, 2014, 04:14:45 AM »
Mark,

thank you for providing that code. I couldn't test it, because gcc doesn't compile it. There are some VS specific constructs and a different inline assembler.

Gunther
Get your facts first, and then you can distort them.

Mark44

  • Regular Member
  • *
  • Posts: 40
Re: 4 X 4 Matrix Multiplication using SSE4.1 DPPS instruction
« Reply #2 on: February 22, 2014, 05:19:48 AM »
Mark,

thank you for providing that code. I couldn't test it, because gcc doesn't compile it. There are some VS specific constructs and a different inline assembler.

Gunther
Thanks for taking a look, Gunther. I'll look into separating out the assembly code to its own file, but it will take me some time to do that. Apart from the inline assembler, which are the VS-specific constructs that gcc has trouble with? Is it the __m128 type I'm using?

Gunther

  • Member
  • *****
  • Posts: 3515
  • Forgive your enemies, but never forget their names
Re: 4 X 4 Matrix Multiplication using SSE4.1 DPPS instruction
« Reply #3 on: February 22, 2014, 05:58:03 AM »
Mark,

Thanks for taking a look, Gunther. I'll look into separating out the assembly code to its own file, but it will take me some time to do that. Apart from the inline assembler, which are the VS-specific constructs that gcc has trouble with? Is it the __m128 type I'm using?

no the __m128 data type isn't a major problem, because the gcc has a similar intrinsic type. But your __declspec is very MS specific and not portable.

Gunther
Get your facts first, and then you can distort them.

Mark44

  • Regular Member
  • *
  • Posts: 40
Re: 4 X 4 Matrix Multiplication using SSE4.1 DPPS instruction
« Reply #4 on: February 22, 2014, 06:20:17 AM »
Mark,

Thanks for taking a look, Gunther. I'll look into separating out the assembly code to its own file, but it will take me some time to do that. Apart from the inline assembler, which are the VS-specific constructs that gcc has trouble with? Is it the __m128 type I'm using?

no the __m128 data type isn't a major problem, because the gcc has a similar intrinsic type. But your __declspec is very MS specific and not portable.

Gunther
OK, that's an easy fix. The reason for __declspec in the array declarations is to align them on 16-byte boundaries. I believe you can remove those __declspec attributes (don't know what else to call them), and change the movaps instructions to movups (i.e., aligned move to unaligned move). If you just remove the __declspec things but don't change the movaps instruction, I seem to recall that you get runtime stack alignment faults.

TWell

  • Member
  • ****
  • Posts: 748
Re: 4 X 4 Matrix Multiplication using SSE4.1 DPPS instruction
« Reply #5 on: February 22, 2014, 06:32:48 AM »
from google:
Code: [Select]
#if defined(_MSC_VER)
    #define __align(_boundary_size) __declspec((align(_boundary_size)))
#else
    #define __align(_boundary_size) __attribute__((aligned(_boundary_size)))
#endif

Gunther

  • Member
  • *****
  • Posts: 3515
  • Forgive your enemies, but never forget their names
Re: 4 X 4 Matrix Multiplication using SSE4.1 DPPS instruction
« Reply #6 on: February 22, 2014, 10:35:04 AM »
Mark,

OK, that's an easy fix. The reason for __declspec in the array declarations is to align them on 16-byte boundaries. I believe you can remove those __declspec attributes (don't know what else to call them), and change the movaps instructions to movups (i.e., aligned move to unaligned move). If you just remove the __declspec things but don't change the movaps instruction, I seem to recall that you get runtime stack alignment faults.

I know the reason for the 16 bit alignment and TWell made a good proposal. If I find enough time, I'll make the necessary changes for compiling your sources with the gcc.

Gunther
Get your facts first, and then you can distort them.

Farabi

  • Member
  • ****
  • Posts: 970
  • Neuroscience Fans
Re: 4 X 4 Matrix Multiplication using SSE4.1 DPPS instruction
« Reply #7 on: February 26, 2014, 10:23:45 PM »
 :t Cool. Never though it can be done on a short code. It might be in the next 3 years I can afford to buy another new laptop.
http://farabidatacenter.url.ph/MySoftware/
My 3D Game Engine Demo.

Contact me at Whatsapp: 6283818314165

Gunther

  • Member
  • *****
  • Posts: 3515
  • Forgive your enemies, but never forget their names
Re: 4 X 4 Matrix Multiplication using SSE4.1 DPPS instruction
« Reply #8 on: February 26, 2014, 10:39:35 PM »
:t Cool. Never though it can be done on a short code. It might be in the next 3 years I can afford to buy another new laptop.

I hope so. It's a fascinating new technology.

Gunther
Get your facts first, and then you can distort them.