Hi all,
I foolishly thought that i would be able to port some SSE2 code across to AVX code to take advantage of the doubling of 32 bit integers operated on at once, even with the newer 3/4 operand instructions taken into account. I am talking about for an i7 2600k btw, AVX is 'supported' by it...
After banging my head in frustration wondering why i couldn't assemble the 256 bit counterpart to the instruction 'pslld', which is 'vpslld' by the way, i read a little bit about it in 'Intel® Architecture Instruction Set Extensions Programming Reference 319433-013B'.
Although there are instructions which operate on both of the 128 bit 'lanes' when using the AVX registers, instructions like bitwise shifting is restricted to the 128 bit XMM registers until the release of CPUs which support AVX2.
What gives? Did Intel just want to troll people with this new AVX label but give half a technology or do they expect the people who have bought their half-baked CPUs to make a workaround (read: unnecessary hack) to allow using all 8 32 bit integers in SIMD operations? Something like casting to a 128 bit register and then operating on those dwords before casting back and rejoining with lower 4 dwords before continuing on in 256 bit land or even having to store 128 bit results side by side in memory and then reloading to a YMM register.
I'm probably putting my foot in my mouth here and there are other instructions which can populate the lower register from the higher register allowing this but i still do feel that they should have given the logical instructions a bit of thought and even considered a bitwise rotation operand like AMD have done with their XOP variant of AVX.
I have since found this on SourceForge, so it seems that operating on 128 bits at a time for certain operations is needed for AVX, i guess after getting used to it and learning the assembler for it things might not be so bad after all...
// shift in zeros
- friend inline int_vec slli(int_vec const & arg, int count)
- {
- __m256 arg_data = _mm256_castsi256_ps(arg.data_);
- __m128 arg_low = _mm256_castps256_ps128(arg_data);
- __m128 arg_hi = _mm256_extractf128_ps(arg_data, 1);
-
- __m128 newlow = (__m128)_mm_slli_epi32((__m128i)arg_low, count);
- __m128 newhi = (__m128)_mm_slli_epi32((__m128i)arg_hi, count);
-
- __m256 result = _mm256_castps128_ps256(newlow);
- result = _mm256_insertf128_ps(result, newhi, 1);
- return result;
- }
-
- // shift in zeros
- friend inline int_vec srli(int_vec const & arg, int count)
- {
- __m256 arg_data = _mm256_castsi256_ps(arg.data_);
- __m128 arg_low = _mm256_castps256_ps128(arg_data);
- __m128 arg_hi = _mm256_extractf128_ps(arg_data, 1);
-
- __m128 newlow = (__m128)_mm_srli_epi32((__m128i)arg_low, count);
- __m128 newhi = (__m128)_mm_srli_epi32((__m128i)arg_hi, count);
-
- __m256 result = _mm256_castps128_ps256(newlow);
- result = _mm256_insertf128_ps(result, newhi, 1);
- return result;
- }
What are other peoples thoughts on this?
HR,
Ghandi