News:

Masm32 SDK description, downloads and other helpful links
Message to All Guests
NB: Posting URL's See here: Posted URL Change

Main Menu

Array Reverse with SSE

Started by frktons, December 04, 2012, 10:15:28 AM

Previous topic - Next topic

frktons

Next step: array reverse.

Attached the full program, with a check for the results.

Any idea how to improve the routine?

Frank
There are only two days a year when you can't do anything: one is called yesterday, the other is called tomorrow, so today is the right day to love, believe, do and, above all, live.

Dalai Lama

frktons

#1
My test with the routine to reverse an array 4096 bytes large:

-------------------------------------------------
Intel(R) Core(TM)2 CPU  E6600  @ 2.40GHz

Instructions: MMX, SSE1, SSE2, SSE3, SSSE3
-------------------------------------------------
1.193   cycles for Reverse Array with PSHUFB
1.193   cycles for Reverse Array with PSHUFB


Any smarter code, or improvement?

Note: PSHUFB needs a CPU with SSSE3 capabilities
or newer one, from Core duo 2 upwards.

Frank
There are only two days a year when you can't do anything: one is called yesterday, the other is called tomorrow, so today is the right day to love, believe, do and, above all, live.

Dalai Lama

frktons

Confronting xmm/pshufb with mov/bswap for reversing data:

Quote
------------------------------------------------------
Intel(R) Core(TM)2 CPU          6600  @ 2.40GHz

Instructions: MMX, SSE1, SSE2, SSE3, SSSE3
------------------------------------------------------
1.194   cycles for Reverse Array with PSHUFB

3.094   cycles for Reverse Array with MOV/BSWAP

1.193   cycles for Reverse Array with PSHUFB

3.096   cycles for Reverse Array with MOV/BSWAP

It looks much faster  :icon_eek:
There are only two days a year when you can't do anything: one is called yesterday, the other is called tomorrow, so today is the right day to love, believe, do and, above all, live.

Dalai Lama

qWord

You may also upload an EXE?
BTW: interesting would be an version with 4 GPRs for BSWAP (16 Bytes)
MREAL macros - when you need floating point arithmetic while assembling!

frktons

Quote from: qWord on December 05, 2012, 10:00:43 AM
You may also upload an EXE?
BTW: interesting would be an version with 4 GPRs for BSWAP (16 Bytes)

Yes for the exe: here it is.
I doubt it could be faster, considering that:

ecx is used for the loop
ebx, eax are used for addressing top and bottom of the
array [it's an "on site" reverse]
4 more GPR are quite a lot to free. If you know how, show me.
The attached file is the exe, change the .zip to .exe to use it.
There are only two days a year when you can't do anything: one is called yesterday, the other is called tomorrow, so today is the right day to love, believe, do and, above all, live.

Dalai Lama

qWord

Quote from: frktons on December 05, 2012, 10:09:26 AM4 more GPR are quite a lot to free. If you know how, show me.
somthing like this:
    lea  esi, [MyArray + sizeof MyArray - 16]     
    lea  edi, MyDestArray
        align 8                           
@@:                               
    mov eax, [esi+0*4] 
    mov ebx, [esi+1*4]
    mov ecx, [esi+2*4]
    mov edx, [esi+3*4]
    bswap edx
    bswap ecx
    bswap ebx
    bswap eax
    mov [edi+0*4],edx
    mov [edi+1*4],ecx
    mov [edi+2*4],ebx
    mov [edi+3*4],eax
    lea esi,[esi-16]
    lea edi,[edi+16]
    cmp esi,OFFSET MyArray
    jae @B
@@:
MREAL macros - when you need floating point arithmetic while assembling!

frktons

qWord, with 4 GPRs I had to change the logic: 2 Arrays instead of 1
but it is quite fast:
Quote
---------------------------------------------------------
Intel(R) Core(TM)2 CPU          6600  @ 2.40GHz

Instructions: MMX, SSE1, SSE2, SSE3, SSSE3
---------------------------------------------------------
1.198   cycles for Reverse Array with PSHUFB
3.098   cycles for Reverse Array with MOV/BSWAP
1.858   cycles for Reverse Array with MOV/BSWAP with 4 GPRs
---------------------------------------------------------
794     cycles for Reverse Array with PSHUFB
2.064   cycles for Reverse Array with MOV/BSWAP
1.680   cycles for Reverse Array with MOV/BSWAP with 4 GPRs
---------------------------------------------------------
attached your new packet with source/exe.
There are only two days a year when you can't do anything: one is called yesterday, the other is called tomorrow, so today is the right day to love, believe, do and, above all, live.

Dalai Lama

qWord

Quote from: frktons on December 05, 2012, 10:54:16 AM
qWord, with 4 GPRs I had to change the logic: 2 Arrays instead of 1
like the SSSE3 version.

Quote---------------------------------------------------------
Intel(R) Core(TM) i7-3610QM CPU @ 2.30GHz

Instructions: MMX, SSE1, SSE2, SSE3, SSSE3, SSE4.1, SSE4.2
---------------------------------------------------------
784     cycles for Reverse Array with PSHUFB
1.696   cycles for Reverse Array with MOV/BSWAP
1.097   cycles for Reverse Array with MOV/BSWAP with 4 GPRs
---------------------------------------------------------
789     cycles for Reverse Array with PSHUFB
1.696   cycles for Reverse Array with MOV/BSWAP
1.106   cycles for Reverse Array with MOV/BSWAP with 4 GPRs
---------------------------------------------------------
MREAL macros - when you need floating point arithmetic while assembling!

frktons

Unrolling the loops gives a boost, and using more data
at the same time gives another boost.
The SSSE3 version takes advantage of both.
There are only two days a year when you can't do anything: one is called yesterday, the other is called tomorrow, so today is the right day to love, believe, do and, above all, live.

Dalai Lama

frktons

#9
And here it is, probably the fastest combination using SSSE3
with 4 xmm registers:
Quote
---------------------------------------------------------
Intel(R) Core(TM)2 CPU          6600  @ 2.40GHz

Instructions: MMX, SSE1, SSE2, SSE3, SSSE3
---------------------------------------------------------
   794  cycles for Reverse Array with PSHUFB
2.071  cycles for Reverse Array with MOV/BSWAP
1.642  cycles for Reverse Array with MOV/BSWAP with 4 GPRs
   594  cycles for Reverse Array with PSHUFB using 4 xmm
---------------------------------------------------------
   794  cycles for Reverse Array with PSHUFB
2.068  cycles for Reverse Array with MOV/BSWAP
1.643  cycles for Reverse Array with MOV/BSWAP with 4 GPRs
   594  cycles for Reverse Array with PSHUFB using 4 xmm
---------------------------------------------------------
There are only two days a year when you can't do anything: one is called yesterday, the other is called tomorrow, so today is the right day to love, believe, do and, above all, live.

Dalai Lama

sinsi

Just FYI, this test and the last one doesn't work on my AMD (Phenom II X6 1100T).
C000001D - illegal instruction

frktons

Quote from: sinsi on December 05, 2012, 07:59:52 PM
Just FYI, this test and the last one doesn't work on my AMD (Phenom II X6 1100T).
C000001D - illegal instruction

Is your CPU able to use SSSE3 instructions like PSHUFB?

I don't know how many differences there are between AMD and INTEL
processors, but I work on Intel, and they work there.

If you point at the instruction that gives error, we can find out what
to do with it.
There are only two days a year when you can't do anything: one is called yesterday, the other is called tomorrow, so today is the right day to love, believe, do and, above all, live.

Dalai Lama

hutch--

Frank,

Put these types of topics in the Lab, that way it won't get lost with other postings on simpler questions.

frktons

Quote from: hutch-- on December 05, 2012, 10:47:42 PM
Frank,

Put these types of topics in the Lab, that way it won't get lost with other postings on simpler questions.

OK Steve. They started as simple questions, but become more intriguing
as I got my hands dirty.  I had to foresee that.  :t
There are only two days a year when you can't do anything: one is called yesterday, the other is called tomorrow, so today is the right day to love, believe, do and, above all, live.

Dalai Lama

frktons

I tried do unroll the SSSE3/PSHUFB with 4 xmm one step more
but I got no relevant gain. Maybe using all 8 xmm at the same time could do
a little bit boost, but I'm not sure. Should I try or shouldn't I?

We'll see.   :icon_cool:
There are only two days a year when you can't do anything: one is called yesterday, the other is called tomorrow, so today is the right day to love, believe, do and, above all, live.

Dalai Lama