How to use the SSE 4.1 DPPS instruction (dot product packed single)

Mark44 · February 22, 2014, 06:13:32 AM

If you need to multiply vectors in your application, the SSE 4.1 DPPS instruction comes in handy. Using it is fairly straightforward except for the third operand, an 8-bit immediate value. I will attempt to explain how this value works in this post.

Assuming that XMM0 and XMM1 have been properly initialized, here is an example:

Code Select

dpps xmm0, xmm1, 0xF1

In this example, XMM0 is both the destination and first input, and XMM2 is the second input. The 8-bit immediate value is the subject of this post.

The immediate operand serves two purposes:

Bits 4 - 7 -- Which portions of the first two operands get multiplied.
Bits 0 - 3 -- Where the resulting product gets placed in the destination (first) operand.

Which portions
The high 4 bits of the immediate operand control which things get multiplied. The bit pattern 0001B picks the low 32 bits of each of the 128-bit operands. Similary, the bit pattern 0010B picks the portions in bits 32 - 63 of the 128-bit operands. Bit pattern 0100B and 1000B pick the portions in bits 64-95 and 96 - 127, respectively.

Other bit patterns in high 4 bits of the immediate pick multiple portions of the 128-bit operands. For a normal dot product, all multiplications should be performed, so the high 4 bits should be 1111B or 0xF.

Where the result is placed
The low 4 bits of the immediate operand control where the result of the multiplication gets placed in the destination. A bit pattern of 0001B causes the result to be placed in the lowest 32 bits of the destination. A bit pattern of 0010B causes the answer to be placed in bits 32 - 63 of the destination. Bit patterns of 0100B and 1000B cause the answer to be placed in bits 64 - 95 and bits 96 - 127, respectively.

Other bit patterns in the low 4 bits cause the answer to be broadcast to multiple portions of the destination. For example, the bit pattern 0011B (or 0x3) causes the same answer to be placed in the low 32 bits of the destination as well as the next higher 32 bits of the destination (i.e., in bits 32 - 63 of the destination).

The SSE version of the DPPS instruction is destructive, in the sense that one of the XMM registers gets overwritten. The AVX version of this instruction, VDPPS, uses the 256-bit YMM registers instead of the 128-bit XMM registers, and like many of the AVX counterparts to the SSE instructions, takes an additional operand so as to be nondestructive.

Here is my complete code, which I compiled with VS 10 on an Intel i7 CPU. I realize that the GNU compiler might choke on the inline assembly - apologies in advance.

Code Select

#include <immintrin.h>

int main(int argc, char* argv[])
{
	__declspec(align(16)) __m128 vec0 = {1.0, 1.0, 1.0, 1.0}, vec1 = {1.0, 2.0, 4.0, 8.0};
	

	__asm{
		movaps xmm0, vec0
		movaps xmm1, vec1
		dpps xmm0, xmm1, 0xF1
	}
	return 0;
}

Gunther · February 22, 2014, 11:07:11 AM

Mark,

we had some long threads about calculating the dot product in the old forum some years ago. Here is one. If you're interested, search inside the old UK forum. Thank you for your code. It seems that you're very active. :t Go forward.

Gunther

Mark44 · February 22, 2014, 11:31:47 AM

Gunther,
I didn't know about the archives, so thanks - some more stuff to pore through.

I'm active because I have now have the time to dawdle over things that interest me (I retired in mid-December). I've managed to keep pretty busy between getting up to speed with the latest and greatest features of the Intel CPU and working on two of my old motorcycles (40's era Harleys).

Gunther · February 22, 2014, 10:18:52 PM

Mark,

Quote from: Mark44 on February 22, 2014, 11:31:47 AM
I'm active because I have now have the time to dawdle over things that interest me (I retired in mid-December). I've managed to keep pretty busy between getting up to speed with the latest and greatest features of the Intel CPU and working on two of my old motorcycles (40's era Harleys).

that sounds interesting. A 40's era Harley is a large temptation, isn't it?

Gunther

Mark44 · February 23, 2014, 04:15:01 AM

Quote from: Gunther on February 22, 2014, 10:18:52 PM
Mark,

Quote from: Mark44 on February 22, 2014, 11:31:47 AM
I'm active because I have now have the time to dawdle over things that interest me (I retired in mid-December). I've managed to keep pretty busy between getting up to speed with the latest and greatest features of the Intel CPU and working on two of my old motorcycles (40's era Harleys).

that sounds interesting. A 40's era Harley is a large temptation, isn't it?

Gunther

Very much so. The oldest of the bunch is a '46 Harley that's been apart for 3+ years. I'm finally getting it back together after what I initially thought would be a short time. The other old one is a '48 that I'm doing some maintenance on to fix a small oil leak. This is a good time of year to work on them, as winter in Washington state is too cold and/or wet for anything other than rides of a few miles.

Gunther · February 23, 2014, 05:48:08 AM

Mark,

Quote from: Mark44 on February 23, 2014, 04:15:01 AM
This is a good time of year to work on them, as winter in Washington state is too cold and/or wet for anything other than rides of a few miles.

so, you can use the time. That's okay. We had a very mild winter here in Berlin. At the moment we've 10 degrees Celsius above zero. That's very warm for February and not so common. The sun is shining all day, the spring flowers are blooming. Would be a great weather for a long ride, unfortunately, I haven't a Harley.

Gunther

The MASM Forum

News:

How to use the SSE 4.1 DPPS instruction (dot product packed single)

Mark44

Gunther

Mark44

Gunther

Mark44

Gunther