Print Page - Array Reverse with SSE

Title: Array Reverse with SSE
Post by: frktons on December 04, 2012, 10:15:28 AM

Next step: array reverse.

Attached the full program, with a check for the results.

Any idea how to improve the routine?

Frank

Title: Re: Array Reverse with SSE
Post by: frktons on December 04, 2012, 10:51:56 PM

My test with the routine to reverse an array 4096 bytes large:

Code Select


-------------------------------------------------
Intel(R) Core(TM)2 CPU  E6600  @ 2.40GHz

Instructions: MMX, SSE1, SSE2, SSE3, SSSE3
-------------------------------------------------
1.193   cycles for Reverse Array with PSHUFB
1.193   cycles for Reverse Array with PSHUFB

Any smarter code, or improvement?

Note: PSHUFB needs a CPU with SSSE3 capabilities
or newer one, from Core duo 2 upwards.

Frank

Title: Re: Array Reverse with SSE
Post by: frktons on December 05, 2012, 09:43:21 AM

Confronting xmm/pshufb with mov/bswap for reversing data:

Quote
------------------------------------------------------
Intel(R) Core(TM)2 CPU 6600 @ 2.40GHz

Instructions: MMX, SSE1, SSE2, SSE3, SSSE3
------------------------------------------------------
1.194 cycles for Reverse Array with PSHUFB

3.094 cycles for Reverse Array with MOV/BSWAP

1.193 cycles for Reverse Array with PSHUFB

3.096 cycles for Reverse Array with MOV/BSWAP

It looks much faster :icon_eek:

Title: Re: Array Reverse with SSE
Post by: qWord on December 05, 2012, 10:00:43 AM

You may also upload an EXE?
BTW: interesting would be an version with 4 GPRs for BSWAP (16 Bytes)

Title: Re: Array Reverse with SSE
Post by: frktons on December 05, 2012, 10:09:26 AM

Quote from: qWord on December 05, 2012, 10:00:43 AM
You may also upload an EXE?
BTW: interesting would be an version with 4 GPRs for BSWAP (16 Bytes)

Yes for the exe: here it is.
I doubt it could be faster, considering that:

ecx is used for the loop
ebx, eax are used for addressing top and bottom of the
array [it's an "on site" reverse]
4 more GPR are quite a lot to free. If you know how, show me.
The attached file is the exe, change the .zip to .exe to use it.

Title: Re: Array Reverse with SSE
Post by: qWord on December 05, 2012, 10:40:41 AM

Quote from: frktons on December 05, 2012, 10:09:26 AM4 more GPR are quite a lot to free. If you know how, show me.

somthing like this:

Code Select

    lea  esi, [MyArray + sizeof MyArray - 16]     
    lea  edi, MyDestArray
        align 8                            
@@:                                
    mov eax, [esi+0*4]  
    mov ebx, [esi+1*4]
    mov ecx, [esi+2*4]
    mov edx, [esi+3*4]
    bswap edx
    bswap ecx
    bswap ebx
    bswap eax
    mov [edi+0*4],edx
    mov [edi+1*4],ecx
    mov [edi+2*4],ebx
    mov [edi+3*4],eax
    lea esi,[esi-16]
    lea edi,[edi+16]
    cmp esi,OFFSET MyArray
    jae @B
@@:

Title: Re: Array Reverse with SSE
Post by: frktons on December 05, 2012, 10:54:16 AM

qWord, with 4 GPRs I had to change the logic: 2 Arrays instead of 1
but it is quite fast:

Quote
---------------------------------------------------------
Intel(R) Core(TM)2 CPU 6600 @ 2.40GHz

Instructions: MMX, SSE1, SSE2, SSE3, SSSE3
---------------------------------------------------------
1.198 cycles for Reverse Array with PSHUFB
3.098 cycles for Reverse Array with MOV/BSWAP
1.858 cycles for Reverse Array with MOV/BSWAP with 4 GPRs
---------------------------------------------------------
794 cycles for Reverse Array with PSHUFB
2.064 cycles for Reverse Array with MOV/BSWAP
1.680 cycles for Reverse Array with MOV/BSWAP with 4 GPRs
---------------------------------------------------------

attached your new packet with source/exe.

Title: Re: Array Reverse with SSE
Post by: qWord on December 05, 2012, 11:04:20 AM

Quote from: frktons on December 05, 2012, 10:54:16 AM
qWord, with 4 GPRs I had to change the logic: 2 Arrays instead of 1

like the SSSE3 version.

Quote---------------------------------------------------------
Intel(R) Core(TM) i7-3610QM CPU @ 2.30GHz

Instructions: MMX, SSE1, SSE2, SSE3, SSSE3, SSE4.1, SSE4.2
---------------------------------------------------------
784 cycles for Reverse Array with PSHUFB
1.696 cycles for Reverse Array with MOV/BSWAP
1.097 cycles for Reverse Array with MOV/BSWAP with 4 GPRs
---------------------------------------------------------
789 cycles for Reverse Array with PSHUFB
1.696 cycles for Reverse Array with MOV/BSWAP
1.106 cycles for Reverse Array with MOV/BSWAP with 4 GPRs
---------------------------------------------------------

Title: Re: Array Reverse with SSE
Post by: frktons on December 05, 2012, 11:15:28 AM

Unrolling the loops gives a boost, and using more data
at the same time gives another boost.
The SSSE3 version takes advantage of both.

Title: Re: Array Reverse with SSE
Post by: frktons on December 05, 2012, 07:36:33 PM

And here it is, probably the fastest combination using SSSE3
with 4 xmm registers:

Quote
---------------------------------------------------------
Intel(R) Core(TM)2 CPU 6600 @ 2.40GHz

Instructions: MMX, SSE1, SSE2, SSE3, SSSE3
---------------------------------------------------------
794 cycles for Reverse Array with PSHUFB
2.071 cycles for Reverse Array with MOV/BSWAP
1.642 cycles for Reverse Array with MOV/BSWAP with 4 GPRs
594 cycles for Reverse Array with PSHUFB using 4 xmm
---------------------------------------------------------
794 cycles for Reverse Array with PSHUFB
2.068 cycles for Reverse Array with MOV/BSWAP
1.643 cycles for Reverse Array with MOV/BSWAP with 4 GPRs
594 cycles for Reverse Array with PSHUFB using 4 xmm
---------------------------------------------------------

Title: Re: Array Reverse with SSE
Post by: sinsi on December 05, 2012, 07:59:52 PM

Just FYI, this test and the last one doesn't work on my AMD (Phenom II X6 1100T).
C000001D - illegal instruction

Title: Re: Array Reverse with SSE
Post by: frktons on December 05, 2012, 08:18:46 PM

Quote from: sinsi on December 05, 2012, 07:59:52 PM
Just FYI, this test and the last one doesn't work on my AMD (Phenom II X6 1100T).
C000001D - illegal instruction

Is your CPU able to use SSSE3 instructions like PSHUFB?

I don't know how many differences there are between AMD and INTEL
processors, but I work on Intel, and they work there.

If you point at the instruction that gives error, we can find out what
to do with it.

Title: Re: Array Reverse with SSE
Post by: hutch-- on December 05, 2012, 10:47:42 PM

Frank,

Put these types of topics in the Lab, that way it won't get lost with other postings on simpler questions.

Title: Re: Array Reverse with SSE
Post by: frktons on December 06, 2012, 01:39:26 AM

Quote from: hutch-- on December 05, 2012, 10:47:42 PM
Frank,

Put these types of topics in the Lab, that way it won't get lost with other postings on simpler questions.

OK Steve. They started as simple questions, but become more intriguing
as I got my hands dirty. I had to foresee that. :t

Title: Re: Array Reverse with SSE
Post by: frktons on December 06, 2012, 11:19:17 AM

I tried do unroll the SSSE3/PSHUFB with 4 xmm one step more
but I got no relevant gain. Maybe using all 8 xmm at the same time could do
a little bit boost, but I'm not sure. Should I try or shouldn't I?

We'll see. :icon_cool:

Title: Re: Array Reverse with SSE
Post by: frktons on December 07, 2012, 12:58:55 AM

I think I reached the bottom line of SSSE3 code:

Quote
------------------------------------------------------------------------
Intel(R) Core(TM)2 CPU 6600 @ 2.40GHz

Instructions: MMX, SSE1, SSE2, SSE3, SSSE3
------------------------------------------------------------------------
1.191 cycles for Reverse Array with PSHUFB
2.064 cycles for Reverse Array with MOV/BSWAP
1.617 cycles for Reverse Array with MOV/BSWAP with 4 GPRs
565 cycles for Reverse Array with PSHUFB using 4 xmm unrolled 4
594 cycles for Reverse Array with PSHUFB using 4 xmm
------------------------------------------------------------------------
794 cycles for Reverse Array with PSHUFB
2.064 cycles for Reverse Array with MOV/BSWAP
1.639 cycles for Reverse Array with MOV/BSWAP with 4 GPRs
565 cycles for Reverse Array with PSHUFB using 4 xmm unrolled 4
596 cycles for Reverse Array with PSHUFB using 4 xmm
------------------------------------------------------------------------

565 CPU Cycles seems to be the lowest reachable value on my system.
While other routines change performance show every time, using 4 xmm
unrolled 4 times tends to give always the same value.

The MASM Forum

General => The Laboratory => Topic started by: frktons on December 04, 2012, 10:15:28 AM