Author Topic: Array Reverse with SSE  (Read 10835 times)

frktons

  • Member
  • ***
  • Posts: 491
Array Reverse with SSE
« on: December 04, 2012, 10:15:28 AM »
Next step: array reverse.

Attached the full program, with a check for the results.

Any idea how to improve the routine?

Frank
« Last Edit: December 05, 2012, 01:27:35 AM by frktons »

frktons

  • Member
  • ***
  • Posts: 491
Re: Array Reverse with SSE
« Reply #1 on: December 04, 2012, 10:51:56 PM »
My test with the routine to reverse an array 4096 bytes large:
Code: [Select]
-------------------------------------------------
Intel(R) Core(TM)2 CPU  E6600  @ 2.40GHz

Instructions: MMX, SSE1, SSE2, SSE3, SSSE3
-------------------------------------------------
1.193   cycles for Reverse Array with PSHUFB
1.193   cycles for Reverse Array with PSHUFB

Any smarter code, or improvement?

Note: PSHUFB needs a CPU with SSSE3 capabilities
or newer one, from Core duo 2 upwards.

Frank
« Last Edit: December 05, 2012, 01:16:40 AM by frktons »

frktons

  • Member
  • ***
  • Posts: 491
Re: Array Reverse with SSE
« Reply #2 on: December 05, 2012, 09:43:21 AM »
Confronting xmm/pshufb with mov/bswap for reversing data:

Quote
------------------------------------------------------
Intel(R) Core(TM)2 CPU          6600  @ 2.40GHz

Instructions: MMX, SSE1, SSE2, SSE3, SSSE3
------------------------------------------------------
1.194   cycles for Reverse Array with PSHUFB

3.094   cycles for Reverse Array with MOV/BSWAP

1.193   cycles for Reverse Array with PSHUFB

3.096   cycles for Reverse Array with MOV/BSWAP

It looks much faster  :icon_eek:

qWord

  • Member
  • *****
  • Posts: 1473
  • The base type of a type is the type itself
    • SmplMath macros
Re: Array Reverse with SSE
« Reply #3 on: December 05, 2012, 10:00:43 AM »
You may also upload an EXE?
BTW: interesting would be an version with 4 GPRs for BSWAP (16 Bytes)
MREAL macros - when you need floating point arithmetic while assembling!

frktons

  • Member
  • ***
  • Posts: 491
Re: Array Reverse with SSE
« Reply #4 on: December 05, 2012, 10:09:26 AM »
You may also upload an EXE?
BTW: interesting would be an version with 4 GPRs for BSWAP (16 Bytes)

Yes for the exe: here it is.
I doubt it could be faster, considering that:

ecx is used for the loop
ebx, eax are used for addressing top and bottom of the
array [it's an "on site" reverse]
4 more GPR are quite a lot to free. If you know how, show me.
The attached file is the exe, change the .zip to .exe to use it.

qWord

  • Member
  • *****
  • Posts: 1473
  • The base type of a type is the type itself
    • SmplMath macros
Re: Array Reverse with SSE
« Reply #5 on: December 05, 2012, 10:40:41 AM »
4 more GPR are quite a lot to free. If you know how, show me.
somthing like this:
Code: [Select]
    lea  esi, [MyArray + sizeof MyArray - 16]     
    lea  edi, MyDestArray
        align 8                           
@@:                               
    mov eax, [esi+0*4] 
    mov ebx, [esi+1*4]
    mov ecx, [esi+2*4]
    mov edx, [esi+3*4]
    bswap edx
    bswap ecx
    bswap ebx
    bswap eax
    mov [edi+0*4],edx
    mov [edi+1*4],ecx
    mov [edi+2*4],ebx
    mov [edi+3*4],eax
    lea esi,[esi-16]
    lea edi,[edi+16]
    cmp esi,OFFSET MyArray
    jae @B
@@:
MREAL macros - when you need floating point arithmetic while assembling!

frktons

  • Member
  • ***
  • Posts: 491
Re: Array Reverse with SSE
« Reply #6 on: December 05, 2012, 10:54:16 AM »
qWord, with 4 GPRs I had to change the logic: 2 Arrays instead of 1
but it is quite fast:
Quote
---------------------------------------------------------
Intel(R) Core(TM)2 CPU          6600  @ 2.40GHz

Instructions: MMX, SSE1, SSE2, SSE3, SSSE3
---------------------------------------------------------
1.198   cycles for Reverse Array with PSHUFB
3.098   cycles for Reverse Array with MOV/BSWAP
1.858   cycles for Reverse Array with MOV/BSWAP with 4 GPRs
---------------------------------------------------------
794     cycles for Reverse Array with PSHUFB
2.064   cycles for Reverse Array with MOV/BSWAP
1.680   cycles for Reverse Array with MOV/BSWAP with 4 GPRs
---------------------------------------------------------
attached your new packet with source/exe.

qWord

  • Member
  • *****
  • Posts: 1473
  • The base type of a type is the type itself
    • SmplMath macros
Re: Array Reverse with SSE
« Reply #7 on: December 05, 2012, 11:04:20 AM »
qWord, with 4 GPRs I had to change the logic: 2 Arrays instead of 1
like the SSSE3 version.

Quote
---------------------------------------------------------
Intel(R) Core(TM) i7-3610QM CPU @ 2.30GHz

Instructions: MMX, SSE1, SSE2, SSE3, SSSE3, SSE4.1, SSE4.2
---------------------------------------------------------
784     cycles for Reverse Array with PSHUFB
1.696   cycles for Reverse Array with MOV/BSWAP
1.097   cycles for Reverse Array with MOV/BSWAP with 4 GPRs
---------------------------------------------------------
789     cycles for Reverse Array with PSHUFB
1.696   cycles for Reverse Array with MOV/BSWAP
1.106   cycles for Reverse Array with MOV/BSWAP with 4 GPRs
---------------------------------------------------------
MREAL macros - when you need floating point arithmetic while assembling!

frktons

  • Member
  • ***
  • Posts: 491
Re: Array Reverse with SSE
« Reply #8 on: December 05, 2012, 11:15:28 AM »
Unrolling the loops gives a boost, and using more data
at the same time gives another boost.
The SSSE3 version takes advantage of both.

frktons

  • Member
  • ***
  • Posts: 491
Re: Array Reverse with SSE
« Reply #9 on: December 05, 2012, 07:36:33 PM »
And here it is, probably the fastest combination using SSSE3
with 4 xmm registers:
Quote
---------------------------------------------------------
Intel(R) Core(TM)2 CPU          6600  @ 2.40GHz

Instructions: MMX, SSE1, SSE2, SSE3, SSSE3
---------------------------------------------------------
   794  cycles for Reverse Array with PSHUFB
 2.071  cycles for Reverse Array with MOV/BSWAP
 1.642  cycles for Reverse Array with MOV/BSWAP with 4 GPRs
   594  cycles for Reverse Array with PSHUFB using 4 xmm
---------------------------------------------------------
   794  cycles for Reverse Array with PSHUFB
 2.068  cycles for Reverse Array with MOV/BSWAP
 1.643  cycles for Reverse Array with MOV/BSWAP with 4 GPRs
   594  cycles for Reverse Array with PSHUFB using 4 xmm
---------------------------------------------------------
« Last Edit: December 05, 2012, 08:47:21 PM by frktons »

sinsi

  • Guest
Re: Array Reverse with SSE
« Reply #10 on: December 05, 2012, 07:59:52 PM »
Just FYI, this test and the last one doesn't work on my AMD (Phenom II X6 1100T).
C000001D - illegal instruction

frktons

  • Member
  • ***
  • Posts: 491
Re: Array Reverse with SSE
« Reply #11 on: December 05, 2012, 08:18:46 PM »
Just FYI, this test and the last one doesn't work on my AMD (Phenom II X6 1100T).
C000001D - illegal instruction

Is your CPU able to use SSSE3 instructions like PSHUFB?

I don't know how many differences there are between AMD and INTEL
processors, but I work on Intel, and they work there.

If you point at the instruction that gives error, we can find out what
to do with it.

hutch--

  • Administrator
  • Member
  • ******
  • Posts: 7537
  • Mnemonic Driven API Grinder
    • The MASM32 SDK
Re: Array Reverse with SSE
« Reply #12 on: December 05, 2012, 10:47:42 PM »
Frank,

Put these types of topics in the Lab, that way it won't get lost with other postings on simpler questions.
hutch at movsd dot com
http://www.masm32.com    :biggrin:  :skrewy:

frktons

  • Member
  • ***
  • Posts: 491
Re: Array Reverse with SSE
« Reply #13 on: December 06, 2012, 01:39:26 AM »
Frank,

Put these types of topics in the Lab, that way it won't get lost with other postings on simpler questions.

OK Steve. They started as simple questions, but become more intriguing
as I got my hands dirty.  I had to foresee that.  :t

frktons

  • Member
  • ***
  • Posts: 491
Re: Array Reverse with SSE
« Reply #14 on: December 06, 2012, 11:19:17 AM »
I tried do unroll the SSSE3/PSHUFB with 4 xmm one step more
but I got no relevant gain. Maybe using all 8 xmm at the same time could do
a little bit boost, but I'm not sure. Should I try or shouldn't I?

We'll see.   :icon_cool: