Author Topic: Have Intel released a crippled, half-finished instruction set with AVX?  (Read 8866 times)

Ghandi

  • Guest
Hi all,

I foolishly thought that i would be able to port some SSE2 code across to AVX code to take advantage of the doubling of 32 bit integers operated on at once, even with the newer 3/4 operand instructions taken into account. I am talking about for an i7 2600k btw, AVX is 'supported' by it...

After banging my head in frustration wondering why i couldn't assemble the 256 bit counterpart to the instruction 'pslld', which is 'vpslld' by the way, i read a little bit about it in 'Intel® Architecture Instruction Set Extensions Programming Reference 319433-013B'.

Although there are instructions which operate on both of the 128 bit 'lanes' when using the AVX registers, instructions like bitwise shifting is restricted to the 128 bit XMM registers until the release of CPUs which support AVX2.

What gives? Did Intel just want to troll people with this new AVX label but give half a technology or do they expect the people who have bought their half-baked CPUs to make a workaround (read: unnecessary hack) to allow using all 8 32 bit integers in SIMD operations? Something like casting to a 128 bit register and then operating on those dwords before casting back and rejoining with lower 4 dwords before continuing on in 256 bit land or even having to store 128 bit results side by side in memory and then reloading to a YMM register.

I'm probably putting my foot in my mouth here and there are other instructions which can populate the lower register from the higher register allowing this but i still do feel that they should have given the logical instructions a bit of thought and even considered a bitwise rotation operand like AMD have done with their XOP variant of AVX.

I have since found this on SourceForge, so it seems that operating on 128 bits at a time for certain operations is needed for AVX, i guess after getting used to it and learning the assembler for it things might not be so bad after all...

Code: [Select]
// shift in zeros
-        friend inline int_vec slli(int_vec const & arg, int count)
-        {
-            __m256 arg_data = _mm256_castsi256_ps(arg.data_);
-            __m128 arg_low =  _mm256_castps256_ps128(arg_data);
-            __m128 arg_hi =   _mm256_extractf128_ps(arg_data, 1);
-
-            __m128 newlow = (__m128)_mm_slli_epi32((__m128i)arg_low, count);
-            __m128 newhi  = (__m128)_mm_slli_epi32((__m128i)arg_hi,  count);
-
-            __m256 result = _mm256_castps128_ps256(newlow);
-            result = _mm256_insertf128_ps(result,  newhi, 1);
-            return result;
-        }
-
-        // shift in zeros
-        friend inline int_vec srli(int_vec const & arg, int count)
-        {
-            __m256 arg_data = _mm256_castsi256_ps(arg.data_);
-            __m128 arg_low =  _mm256_castps256_ps128(arg_data);
-            __m128 arg_hi =   _mm256_extractf128_ps(arg_data, 1);
-
-            __m128 newlow = (__m128)_mm_srli_epi32((__m128i)arg_low, count);
-            __m128 newhi  = (__m128)_mm_srli_epi32((__m128i)arg_hi,  count);
-
-            __m256 result = _mm256_castps128_ps256(newlow);
-            result = _mm256_insertf128_ps(result,  newhi, 1);
-            return result;
-        }

What are other peoples thoughts on this?

HR,
Ghandi

qWord

  • Member
  • *****
  • Posts: 1475
  • The base type of a type is the type itself
    • SmplMath macros
hi,

I'm also not happy with this weak support for bit manipulating instructions, but I can understand why they do this: the point is that additional instruction (or extending existing ones) requires more logic on the Die and increase the power consumption. In this context, it is the question how many applications need/use such instruction...
MREAL macros - when you need floating point arithmetic while assembling!

Ghandi

  • Guest
That is a good reply and one which i honestly hadn't considered in my little 'awwwww' moment. What about leaving out the HD3000 integrated GPU which i don't ever intend on using and using that real estate for CPU power though?

This is the reason i bought a CPU to begin with after all and i do already use my poor gtx-550i for GPGPU number crunching when my sloppy CUDA/C coding doesn't cause a GP fault, crashing either the display driver or Windows 7 itself. If i go using my CPU as a GPU my rig will be a right mess, ;)

I think once i get my head around that the AVX instructions seem more like HLL api now, specify the source(s) and destination plus extra parameters for the operation, it wont seem so bad. I have never really studied the SIMD instruction sets a great deal though to be honest because the bitwise logical support was sufficient for my needs which were SIMD MD5, etc.

Because i haven't explored these instructions, i don't even know their full potential and often i find myself saving XMM results to memory then using GP registers to loading values in smaller form for use or comparison and just for instance with MD5 code, the resulting hashes are in 4 x 128 bit vectors that need to be swapped from vertical to horizontal otherwise comparing code has to deal with the fact that it only checks the Nth dword of each 128 bit value.

I'm sure there are shuffling instructions or such which would speed this up but being the caveman that i am, i revert back to x86/x64 GP registers.

I did wish though that there was a rotate instruction instead of having to perform a C-ism by copying the source to another register, bit-shifting each the appropriate amount of bits in opposite directions and then recombining them to make the rotated value, it seemed a real miss of the target for Intel to not have rotation instructions.

AMD have addressed this with their Bulldozer processor from what i've read, they also adopted support for Intel's AVX instruction set and although i read a lot of rubbish online from time to time, if what i've read about AMD vs nVidia for sheer GPGPU grunt is accurate they're rolling on strong on that front as well.

Kinda makes me wonder if i should have bought an AMD box instead of my Intel/nVidia setup, :O

HR,
Ghandi
« Last Edit: July 19, 2012, 01:02:23 PM by Ghandi »

Gunther

  • Member
  • *****
  • Posts: 4115
  • Forgive your enemies, but never forget their names
Kinda makes me wonder if i should have bought an AMD box instead of my Intel/nVidia setup, :O

Is the Buldozzer AVX aware?

Gunther
You have to know the facts before you can distort them.

Ghandi

  • Guest
Yes, although if i read correctly the AMD takes 2 clocks to perform a single AVX/XOP instruction whereas the Intel only takes one.

HR,
Ghandi

Greenhorn

  • Member
  • ***
  • Posts: 434
Hi,

maybe this blog post from agner helps a little bit to understand how AMD is doing this.
Test results for AMD Bulldozer processor

I hope Piledriver, which comes out in autumn, will fix it all. ;)


Greenhorn
Kole Feut un Nordenwind gift en krusen Büdel un en lütten Pint.

Gunther

  • Member
  • *****
  • Posts: 4115
  • Forgive your enemies, but never forget their names
Hi,

maybe this blog post from agner helps a little bit to understand how AMD is doing this.
Test results for AMD Bulldozer processor

I hope Piledriver, which comes out in autumn, will fix it all. ;)

Greenhorn

Interesting blog post by Agner Fog. Worth reading. Thank you, Greenhorn.

Gunther
You have to know the facts before you can distort them.