News:

Masm32 SDK description, downloads and other helpful links
Message to All Guests

Main Menu

Component test using bswap, xmm and mmx.

Started by hutch--, March 23, 2015, 02:38:34 PM

Previous topic - Next topic

hutch--

This is not a working algorithm but a test of a component to reverse 12 bytes starting from the end of the string and reversing it so its the right way around at the start of a buffer. I had in mind a particular type of DWORD to ASCII conversion. The BSWAP version reverses 12 bytes, the mmx and xmm versions reverse 16 bytes.

Note that pshufb needs an SSSE3 machine to run the instruction.


; ¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤

comment * -----------------------------------------------------
                        Build this  template with
                       "CONSOLE ASSEMBLE AND LINK"
        ----------------------------------------------------- *

    include \masm32\include\masm32rt.inc
    .686p
    .mmx
    .xmm

    reverse12bytes PROTO :DWORD
    reverse16bytes PROTO :DWORD

    mmxreverse PROTO :DWORD


    .data
    ; --------------------------
    ; item must be 16 bytes long
    ; --------------------------
      item db 0,0,0,0,"lkfihgfedcba"
      pitm dd item

    ; --------------------------
    ; item must be 12 bytes long
    ; --------------------------
      item2 db 0,0,0,0,"87654321"
      pitm2 dd item2

    ; --------------------------
    ; item must be 16 bytes long
    ; --------------------------
      item3 db 0,0,0,0,"210987654321"
      pitm3 dd item3


    .code

start:
   
; ¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤

    call main
    inkey
    exit

; ¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤

main proc

    LOCAL tcnt  :DWORD

    push ebx
    push esi
    push edi

    print "--------------------------------",13,10
    print "performance test of 2 algorithms",13,10
    print "--------------------------------",13,10

    fn reverse16bytes,pitm
    print pitm,13,10

    fn reverse12bytes,pitm2
    print pitm2,13,10

    fn mmxreverse,pitm3
    print pitm3,13,10

    print "--------------------------------",13,10
    print "benchmark of 3 algorithms",13,10
    print "--------------------------------",13,10

  ; --------------------------

    fn GetTickCount
    push eax

    mov esi, 500000000
  ; align 16
  lbl2:
    fn reverse16bytes,pitm
    sub esi, 1
    jnz lbl2

    fn GetTickCount
    pop ecx
    sub eax, ecx

    print ustr$(eax)," ms pshufb",13,10

  ; --------------------------

    fn GetTickCount
    push eax

    mov esi, 500000000
  ; align 16
  lbl1:
    fn reverse12bytes,pitm2
    sub esi, 1
    jnz lbl1

    fn GetTickCount
    pop ecx
    sub eax, ecx

    print ustr$(eax)," ms bswap",13,10

  ; --------------------------

    fn GetTickCount
    push eax

    mov esi, 500000000
  ; align 16
  lbl3:
    fn mmxreverse,pitm3
    sub esi, 1
    jnz lbl3

    fn GetTickCount
    pop ecx
    sub eax, ecx

    print ustr$(eax)," ms mmxreverse",13,10

  ; --------------------------

  past:

    pop edi
    pop esi
    pop ebx

    ret

main endp

; ¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤

align 16

OPTION PROLOGUE:NONE
OPTION EPILOGUE:NONE

reverse12bytes proc pbuf:DWORD

    mov edx, [esp+4]

  ; ------------------------------------------
  ; byte swap and exchange the two 4 byte ends
  ; ------------------------------------------
    mov ecx, [edx+8]
    bswap ecx

    mov eax, [edx]
    bswap eax

    mov [edx+8], eax
    mov [edx], ecx

  ; ------------------------------------------
  ; byte swap the middle 4 bytes
  ; ------------------------------------------
    mov eax, [edx+4]
    bswap eax
    mov [edx+4], eax

    ret 4

reverse12bytes endp

OPTION PROLOGUE:PrologueDef
OPTION EPILOGUE:EpilogueDef

; ¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤

align 16

OPTION PROLOGUE:NONE
OPTION EPILOGUE:NONE

reverse16bytes proc ptritm:DWORD

    .data
    ; ----------------------------------
    ; mask has 16 bytes in reverse order
    ; ----------------------------------
      bmask00000000 \
              db 0Fh,0Eh,0Dh,0Ch
              db 0Bh,0Ah,09h,08h
              db 07h,06h,05h,04h
              db 03h,02h,01h,00h
      pmsk00000000 dd bmask00000000
    .code

    mov eax, [esp+4]                            ; load source
    movdqu xmm1, [eax]
    mov ecx, pmsk00000000                       ; load mask
    movdqu xmm2, [ecx]

    pshufb xmm1,xmm2                            ; shuffle bytes according to mask order

    movdqu [eax], xmm1                          ; copy result back to memory

    ret 4

reverse16bytes endp

OPTION PROLOGUE:PrologueDef
OPTION EPILOGUE:EpilogueDef

; ¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤

align 16

OPTION PROLOGUE:NONE
OPTION EPILOGUE:NONE

mmxreverse proc pvalu:DWORD

    .data
      mask001 \
        db 07h,06h,05h,04h
        db 03h,02h,01h,00h
      pmsk001 dd mask001
    .code

    mov eax, [esp+4]

    mov ecx, pmsk001
    movq mm1, [eax]
    movq mm2, [eax+8]
    movq mm3, [ecx]

    pshufb mm1,mm3
    pshufb mm2,mm3

    movq [eax+8], mm1
    movq [eax], mm2

    ret 4

mmxreverse endp

OPTION PROLOGUE:PrologueDef
OPTION EPILOGUE:EpilogueDef

; ¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤

end start


These are typical results on my middle aged i7.


--------------------------------
performance test of 3 algorithms
--------------------------------
abcdefghifkl
12345678
123456789012
--------------------------------
benchmark of 3 algorithms
--------------------------------
2059 ms pshufb
2231 ms bswap
2215 ms mmxreverse
Press any key to continue ...

rrr314159

Now I understand what pshufb does, thx, the Intel manual is confusing. Figured I'd see it used by somebody someday. It's not very impressive compared to old-fashioned moving the bytes around and bswapping, 2059 vs. 2231, of course that's 16 vs 12, but still, just 44% or so improvement. A lot of these new instructions are disappointing. Of course pshufb can shuffle other permutations with equal ease; but, requires 16 bytes for the mask. All in all, hardly seems like 10 years' worth of progress.

Do u use mmx? Thought it was pretty much dead, xmm better and doesn't interfere with FPU. I suppose you're just covering all the bases, as long as you're at it?

BTW you could use "movdqa xmm2, [ecx]" when u load the mask instead of unaligned movdqu, presumably a little faster.

Maybe pshufb will be more impressive on a "younger" i7.
I am NaN ;)

hutch--

It was mainly that I wanted to test if 2 x 64 bit ops were much slower than a single 128 bit op and its about 10% which is not a big factor in the target I had in mind. "pshufb" is better suited to streaming data in 128 bit chunks and doing things like BSWAP on 4 DWORDs really improves the speed. I used MOVDQU over the aligned version because on both my old Core2 quad and this i7 if the data was aligned there was barely any speed difference and the unaligned version is more flexible. I keep in mind that an i7 is still a 64 bit processor so 128 bit ops and still done internally as 2 x 64 bit ops but it seems the only gain with the 128 bit version is it has a lower instruction count.

sinsi

The version of ML that's with MASM32 won't assemble that.
Some results from different CPUs

Intel i7-4790 @3.6GHz
922 ms pshufb
1047 ms bswap
1015 ms mmxreverse

AMD A10-7850K @3.7GHz
1825 ms pshufb
1560 ms bswap
1904 ms mmxreverse

Intel Q9450 @2.66GHz
2437 ms pshufb
1860 ms bswap
4093 ms mmxreverse


You know it's easier to select someone's code with the "select" link above, right?

jj2007

i5:
1248 ms pshufb
1560 ms bswap
2480 ms mmxreverse

I could have sworn that we had a dedicated thread on reversing strings some months ago but can't find it :(

dedndave


jj2007

Quote from: dedndave on March 23, 2015, 09:51:25 PM
maybe you were thinking of reversing bits ?

No, it was bytes, just a few months ago. Here is what I used afterwards:

  sub edi, 4            ; ------------ revert string start ------------
  mov edx, edi
  add edx, f2sOverOne
  sub edx, esi            ; StrLen 3 and less?
  push edx
  jb @F
  align 4                  ; esi=start, edi=end-4
  .Repeat
      lodsd
      mov edx, [edi]
      bswap eax
      bswap edx
      mov [edi], eax
      mov [esi-4], edx
      sub edi, 4
  .Until edi<esi
@@:      add edi, 3
@@:      cmp edi, esi
  jb @F
  mov al, byte ptr [esi]
  mov dl, byte ptr [edi]
  mov [esi], dl
  mov [edi], al
  inc esi
  dec edi
  jmp @B                  ; ------------ revert string end ------------