Next step: array reverse.
Attached the full program, with a check for the results.
Any idea how to improve the routine?
Frank
My test with the routine to reverse an array 4096 bytes large:
-------------------------------------------------
Intel(R) Core(TM)2 CPU E6600 @ 2.40GHz
Instructions: MMX, SSE1, SSE2, SSE3, SSSE3
-------------------------------------------------
1.193 cycles for Reverse Array with PSHUFB
1.193 cycles for Reverse Array with PSHUFB
Any smarter code, or improvement?
Note: PSHUFB needs a CPU with SSSE3 capabilities
or newer one, from Core duo 2 upwards.
Frank
Confronting xmm/pshufb with mov/bswap for reversing data:
Quote
------------------------------------------------------
Intel(R) Core(TM)2 CPU 6600 @ 2.40GHz
Instructions: MMX, SSE1, SSE2, SSE3, SSSE3
------------------------------------------------------
1.194 cycles for Reverse Array with PSHUFB
3.094 cycles for Reverse Array with MOV/BSWAP
1.193 cycles for Reverse Array with PSHUFB
3.096 cycles for Reverse Array with MOV/BSWAP
It looks much faster :icon_eek:
You may also upload an EXE?
BTW: interesting would be an version with 4 GPRs for BSWAP (16 Bytes)
Quote from: qWord on December 05, 2012, 10:00:43 AM
You may also upload an EXE?
BTW: interesting would be an version with 4 GPRs for BSWAP (16 Bytes)
Yes for the exe: here it is.
I doubt it could be faster, considering that:
ecx is used for the loop
ebx, eax are used for addressing top and bottom of the
array [it's an "on site" reverse]
4 more GPR are quite a lot to free. If you know how, show me.
The attached file is the exe, change the .zip to .exe to use it.
Quote from: frktons on December 05, 2012, 10:09:26 AM4 more GPR are quite a lot to free. If you know how, show me.
somthing like this:
lea esi, [MyArray + sizeof MyArray - 16]
lea edi, MyDestArray
align 8
@@:
mov eax, [esi+0*4]
mov ebx, [esi+1*4]
mov ecx, [esi+2*4]
mov edx, [esi+3*4]
bswap edx
bswap ecx
bswap ebx
bswap eax
mov [edi+0*4],edx
mov [edi+1*4],ecx
mov [edi+2*4],ebx
mov [edi+3*4],eax
lea esi,[esi-16]
lea edi,[edi+16]
cmp esi,OFFSET MyArray
jae @B
@@:
qWord, with 4 GPRs I had to change the logic: 2 Arrays instead of 1
but it is quite fast:
Quote
---------------------------------------------------------
Intel(R) Core(TM)2 CPU 6600 @ 2.40GHz
Instructions: MMX, SSE1, SSE2, SSE3, SSSE3
---------------------------------------------------------
1.198 cycles for Reverse Array with PSHUFB
3.098 cycles for Reverse Array with MOV/BSWAP
1.858 cycles for Reverse Array with MOV/BSWAP with 4 GPRs
---------------------------------------------------------
794 cycles for Reverse Array with PSHUFB
2.064 cycles for Reverse Array with MOV/BSWAP
1.680 cycles for Reverse Array with MOV/BSWAP with 4 GPRs
---------------------------------------------------------
attached your new packet with source/exe.
Quote from: frktons on December 05, 2012, 10:54:16 AM
qWord, with 4 GPRs I had to change the logic: 2 Arrays instead of 1
like the SSSE3 version.
Quote---------------------------------------------------------
Intel(R) Core(TM) i7-3610QM CPU @ 2.30GHz
Instructions: MMX, SSE1, SSE2, SSE3, SSSE3, SSE4.1, SSE4.2
---------------------------------------------------------
784 cycles for Reverse Array with PSHUFB
1.696 cycles for Reverse Array with MOV/BSWAP
1.097 cycles for Reverse Array with MOV/BSWAP with 4 GPRs
---------------------------------------------------------
789 cycles for Reverse Array with PSHUFB
1.696 cycles for Reverse Array with MOV/BSWAP
1.106 cycles for Reverse Array with MOV/BSWAP with 4 GPRs
---------------------------------------------------------
Unrolling the loops gives a boost, and using more data
at the same time gives another boost.
The SSSE3 version takes advantage of both.
And here it is, probably the fastest combination using SSSE3
with 4 xmm registers:
Quote
---------------------------------------------------------
Intel(R) Core(TM)2 CPU 6600 @ 2.40GHz
Instructions: MMX, SSE1, SSE2, SSE3, SSSE3
---------------------------------------------------------
794 cycles for Reverse Array with PSHUFB
2.071 cycles for Reverse Array with MOV/BSWAP
1.642 cycles for Reverse Array with MOV/BSWAP with 4 GPRs
594 cycles for Reverse Array with PSHUFB using 4 xmm
---------------------------------------------------------
794 cycles for Reverse Array with PSHUFB
2.068 cycles for Reverse Array with MOV/BSWAP
1.643 cycles for Reverse Array with MOV/BSWAP with 4 GPRs
594 cycles for Reverse Array with PSHUFB using 4 xmm
---------------------------------------------------------
Just FYI, this test and the last one doesn't work on my AMD (Phenom II X6 1100T).
C000001D - illegal instruction
Quote from: sinsi on December 05, 2012, 07:59:52 PM
Just FYI, this test and the last one doesn't work on my AMD (Phenom II X6 1100T).
C000001D - illegal instruction
Is your CPU able to use SSSE3 instructions like
PSHUFB?
I don't know how many differences there are between AMD and INTEL
processors, but I work on Intel, and they work there.
If you point at the instruction that gives error, we can find out what
to do with it.
Frank,
Put these types of topics in the Lab, that way it won't get lost with other postings on simpler questions.
Quote from: hutch-- on December 05, 2012, 10:47:42 PM
Frank,
Put these types of topics in the Lab, that way it won't get lost with other postings on simpler questions.
OK Steve. They started as simple questions, but become more intriguing
as I got my hands dirty. I had to foresee that. :t
I tried do unroll the SSSE3/PSHUFB with 4 xmm one step more
but I got no relevant gain. Maybe using all 8 xmm at the same time could do
a little bit boost, but I'm not sure. Should I try or shouldn't I?
We'll see. :icon_cool:
I think I reached the bottom line of SSSE3 code:
Quote
------------------------------------------------------------------------
Intel(R) Core(TM)2 CPU 6600 @ 2.40GHz
Instructions: MMX, SSE1, SSE2, SSE3, SSSE3
------------------------------------------------------------------------
1.191 cycles for Reverse Array with PSHUFB
2.064 cycles for Reverse Array with MOV/BSWAP
1.617 cycles for Reverse Array with MOV/BSWAP with 4 GPRs
565 cycles for Reverse Array with PSHUFB using 4 xmm unrolled 4
594 cycles for Reverse Array with PSHUFB using 4 xmm
------------------------------------------------------------------------
794 cycles for Reverse Array with PSHUFB
2.064 cycles for Reverse Array with MOV/BSWAP
1.639 cycles for Reverse Array with MOV/BSWAP with 4 GPRs
565 cycles for Reverse Array with PSHUFB using 4 xmm unrolled 4
596 cycles for Reverse Array with PSHUFB using 4 xmm
------------------------------------------------------------------------
565 CPU Cycles seems to be the lowest reachable value on my system.
While other routines change performance show every time, using 4 xmm
unrolled 4 times tends to give always the same value.