The MASM Forum

General => The Laboratory => Topic started by: frktons on December 04, 2012, 10:15:28 AM

Title: Array Reverse with SSE
Post by: frktons on December 04, 2012, 10:15:28 AM
Next step: array reverse.

Attached the full program, with a check for the results.

Any idea how to improve the routine?

Frank
Title: Re: Array Reverse with SSE
Post by: frktons on December 04, 2012, 10:51:56 PM
My test with the routine to reverse an array 4096 bytes large:
Code: [Select]
-------------------------------------------------
Intel(R) Core(TM)2 CPU  E6600  @ 2.40GHz

Instructions: MMX, SSE1, SSE2, SSE3, SSSE3
-------------------------------------------------
1.193   cycles for Reverse Array with PSHUFB
1.193   cycles for Reverse Array with PSHUFB

Any smarter code, or improvement?

Note: PSHUFB needs a CPU with SSSE3 capabilities
or newer one, from Core duo 2 upwards.

Frank
Title: Re: Array Reverse with SSE
Post by: frktons on December 05, 2012, 09:43:21 AM
Confronting xmm/pshufb with mov/bswap for reversing data:

Quote
------------------------------------------------------
Intel(R) Core(TM)2 CPU          6600  @ 2.40GHz

Instructions: MMX, SSE1, SSE2, SSE3, SSSE3
------------------------------------------------------
1.194   cycles for Reverse Array with PSHUFB

3.094   cycles for Reverse Array with MOV/BSWAP

1.193   cycles for Reverse Array with PSHUFB

3.096   cycles for Reverse Array with MOV/BSWAP

It looks much faster  :icon_eek:
Title: Re: Array Reverse with SSE
Post by: qWord on December 05, 2012, 10:00:43 AM
You may also upload an EXE?
BTW: interesting would be an version with 4 GPRs for BSWAP (16 Bytes)
Title: Re: Array Reverse with SSE
Post by: frktons on December 05, 2012, 10:09:26 AM
You may also upload an EXE?
BTW: interesting would be an version with 4 GPRs for BSWAP (16 Bytes)

Yes for the exe: here it is.
I doubt it could be faster, considering that:

ecx is used for the loop
ebx, eax are used for addressing top and bottom of the
array [it's an "on site" reverse]
4 more GPR are quite a lot to free. If you know how, show me.
The attached file is the exe, change the .zip to .exe to use it.
Title: Re: Array Reverse with SSE
Post by: qWord on December 05, 2012, 10:40:41 AM
4 more GPR are quite a lot to free. If you know how, show me.
somthing like this:
Code: [Select]
    lea  esi, [MyArray + sizeof MyArray - 16]     
    lea  edi, MyDestArray
        align 8                           
@@:                               
    mov eax, [esi+0*4] 
    mov ebx, [esi+1*4]
    mov ecx, [esi+2*4]
    mov edx, [esi+3*4]
    bswap edx
    bswap ecx
    bswap ebx
    bswap eax
    mov [edi+0*4],edx
    mov [edi+1*4],ecx
    mov [edi+2*4],ebx
    mov [edi+3*4],eax
    lea esi,[esi-16]
    lea edi,[edi+16]
    cmp esi,OFFSET MyArray
    jae @B
@@:
Title: Re: Array Reverse with SSE
Post by: frktons on December 05, 2012, 10:54:16 AM
qWord, with 4 GPRs I had to change the logic: 2 Arrays instead of 1
but it is quite fast:
Quote
---------------------------------------------------------
Intel(R) Core(TM)2 CPU          6600  @ 2.40GHz

Instructions: MMX, SSE1, SSE2, SSE3, SSSE3
---------------------------------------------------------
1.198   cycles for Reverse Array with PSHUFB
3.098   cycles for Reverse Array with MOV/BSWAP
1.858   cycles for Reverse Array with MOV/BSWAP with 4 GPRs
---------------------------------------------------------
794     cycles for Reverse Array with PSHUFB
2.064   cycles for Reverse Array with MOV/BSWAP
1.680   cycles for Reverse Array with MOV/BSWAP with 4 GPRs
---------------------------------------------------------
attached your new packet with source/exe.
Title: Re: Array Reverse with SSE
Post by: qWord on December 05, 2012, 11:04:20 AM
qWord, with 4 GPRs I had to change the logic: 2 Arrays instead of 1
like the SSSE3 version.

Quote
---------------------------------------------------------
Intel(R) Core(TM) i7-3610QM CPU @ 2.30GHz

Instructions: MMX, SSE1, SSE2, SSE3, SSSE3, SSE4.1, SSE4.2
---------------------------------------------------------
784     cycles for Reverse Array with PSHUFB
1.696   cycles for Reverse Array with MOV/BSWAP
1.097   cycles for Reverse Array with MOV/BSWAP with 4 GPRs
---------------------------------------------------------
789     cycles for Reverse Array with PSHUFB
1.696   cycles for Reverse Array with MOV/BSWAP
1.106   cycles for Reverse Array with MOV/BSWAP with 4 GPRs
---------------------------------------------------------
Title: Re: Array Reverse with SSE
Post by: frktons on December 05, 2012, 11:15:28 AM
Unrolling the loops gives a boost, and using more data
at the same time gives another boost.
The SSSE3 version takes advantage of both.
Title: Re: Array Reverse with SSE
Post by: frktons on December 05, 2012, 07:36:33 PM
And here it is, probably the fastest combination using SSSE3
with 4 xmm registers:
Quote
---------------------------------------------------------
Intel(R) Core(TM)2 CPU          6600  @ 2.40GHz

Instructions: MMX, SSE1, SSE2, SSE3, SSSE3
---------------------------------------------------------
   794  cycles for Reverse Array with PSHUFB
 2.071  cycles for Reverse Array with MOV/BSWAP
 1.642  cycles for Reverse Array with MOV/BSWAP with 4 GPRs
   594  cycles for Reverse Array with PSHUFB using 4 xmm
---------------------------------------------------------
   794  cycles for Reverse Array with PSHUFB
 2.068  cycles for Reverse Array with MOV/BSWAP
 1.643  cycles for Reverse Array with MOV/BSWAP with 4 GPRs
   594  cycles for Reverse Array with PSHUFB using 4 xmm
---------------------------------------------------------
Title: Re: Array Reverse with SSE
Post by: sinsi on December 05, 2012, 07:59:52 PM
Just FYI, this test and the last one doesn't work on my AMD (Phenom II X6 1100T).
C000001D - illegal instruction
Title: Re: Array Reverse with SSE
Post by: frktons on December 05, 2012, 08:18:46 PM
Just FYI, this test and the last one doesn't work on my AMD (Phenom II X6 1100T).
C000001D - illegal instruction

Is your CPU able to use SSSE3 instructions like PSHUFB?

I don't know how many differences there are between AMD and INTEL
processors, but I work on Intel, and they work there.

If you point at the instruction that gives error, we can find out what
to do with it.
Title: Re: Array Reverse with SSE
Post by: hutch-- on December 05, 2012, 10:47:42 PM
Frank,

Put these types of topics in the Lab, that way it won't get lost with other postings on simpler questions.
Title: Re: Array Reverse with SSE
Post by: frktons on December 06, 2012, 01:39:26 AM
Frank,

Put these types of topics in the Lab, that way it won't get lost with other postings on simpler questions.

OK Steve. They started as simple questions, but become more intriguing
as I got my hands dirty.  I had to foresee that.  :t
Title: Re: Array Reverse with SSE
Post by: frktons on December 06, 2012, 11:19:17 AM
I tried do unroll the SSSE3/PSHUFB with 4 xmm one step more
but I got no relevant gain. Maybe using all 8 xmm at the same time could do
a little bit boost, but I'm not sure. Should I try or shouldn't I?

We'll see.   :icon_cool:
Title: Re: Array Reverse with SSE
Post by: frktons on December 07, 2012, 12:58:55 AM
I think I reached the bottom line of SSSE3 code:
Quote
------------------------------------------------------------------------
Intel(R) Core(TM)2 CPU          6600  @ 2.40GHz

Instructions: MMX, SSE1, SSE2, SSE3, SSSE3
------------------------------------------------------------------------
 1.191  cycles for Reverse Array with PSHUFB
 2.064  cycles for Reverse Array with MOV/BSWAP
 1.617  cycles for Reverse Array with MOV/BSWAP with 4 GPRs
   565  cycles for Reverse Array with PSHUFB using 4 xmm unrolled 4
   594  cycles for Reverse Array with PSHUFB using 4 xmm
------------------------------------------------------------------------
   794  cycles for Reverse Array with PSHUFB
 2.064  cycles for Reverse Array with MOV/BSWAP
 1.639  cycles for Reverse Array with MOV/BSWAP with 4 GPRs
   565  cycles for Reverse Array with PSHUFB using 4 xmm unrolled 4
   596  cycles for Reverse Array with PSHUFB using 4 xmm
------------------------------------------------------------------------

565 CPU Cycles seems to be the lowest reachable value on my system.
While other routines change performance show every time, using 4 xmm
unrolled 4 times tends to give always the same value.