I think I reached the bottom line of SSSE3 code:
------------------------------------------------------------------------
Intel(R) Core(TM)2 CPU 6600 @ 2.40GHz
Instructions: MMX, SSE1, SSE2, SSE3, SSSE3
------------------------------------------------------------------------
1.191 cycles for Reverse Array with PSHUFB
2.064 cycles for Reverse Array with MOV/BSWAP
1.617 cycles for Reverse Array with MOV/BSWAP with 4 GPRs
565 cycles for Reverse Array with PSHUFB using 4 xmm unrolled 4
594 cycles for Reverse Array with PSHUFB using 4 xmm
------------------------------------------------------------------------
794 cycles for Reverse Array with PSHUFB
2.064 cycles for Reverse Array with MOV/BSWAP
1.639 cycles for Reverse Array with MOV/BSWAP with 4 GPRs
565 cycles for Reverse Array with PSHUFB using 4 xmm unrolled 4
596 cycles for Reverse Array with PSHUFB using 4 xmm
------------------------------------------------------------------------
565 CPU Cycles seems to be the lowest reachable value on my system.
While other routines change performance show every time, using 4 xmm
unrolled 4 times tends to give always the same value.