Component test using bswap, xmm and mmx.

hutch-- · March 23, 2015, 02:38:34 PM

This is not a working algorithm but a test of a component to reverse 12 bytes starting from the end of the string and reversing it so its the right way around at the start of a buffer. I had in mind a particular type of DWORD to ASCII conversion. The BSWAP version reverses 12 bytes, the mmx and xmm versions reverse 16 bytes.

Note that pshufb needs an SSSE3 machine to run the instruction.

; ¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤

comment * -----------------------------------------------------
Build this template with
"CONSOLE ASSEMBLE AND LINK"
----------------------------------------------------- *

include \masm32\include\masm32rt.inc
.686p
.mmx
.xmm

reverse12bytes PROTO :DWORD
reverse16bytes PROTO :DWORD

mmxreverse PROTO :DWORD

.data
; --------------------------
; item must be 16 bytes long
; --------------------------
item db 0,0,0,0,"lkfihgfedcba"
pitm dd item

; --------------------------
; item must be 12 bytes long
; --------------------------
item2 db 0,0,0,0,"87654321"
pitm2 dd item2

; --------------------------
; item must be 16 bytes long
; --------------------------
item3 db 0,0,0,0,"210987654321"
pitm3 dd item3

.code

start:

; ¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤

call main
inkey
exit

; ¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤

main proc

LOCAL tcnt :DWORD

push ebx
push esi
push edi

print "--------------------------------",13,10
print "performance test of 2 algorithms",13,10
print "--------------------------------",13,10

fn reverse16bytes,pitm
print pitm,13,10

fn reverse12bytes,pitm2
print pitm2,13,10

fn mmxreverse,pitm3
print pitm3,13,10

print "--------------------------------",13,10
print "benchmark of 3 algorithms",13,10
print "--------------------------------",13,10

; --------------------------

fn GetTickCount
push eax

mov esi, 500000000
; align 16
lbl2:
fn reverse16bytes,pitm
sub esi, 1
jnz lbl2

fn GetTickCount
pop ecx
sub eax, ecx

print ustr$(eax)," ms pshufb",13,10

; --------------------------

fn GetTickCount
push eax

mov esi, 500000000
; align 16
lbl1:
fn reverse12bytes,pitm2
sub esi, 1
jnz lbl1

fn GetTickCount
pop ecx
sub eax, ecx

print ustr$(eax)," ms bswap",13,10

; --------------------------

fn GetTickCount
push eax

mov esi, 500000000
; align 16
lbl3:
fn mmxreverse,pitm3
sub esi, 1
jnz lbl3

fn GetTickCount
pop ecx
sub eax, ecx

print ustr$(eax)," ms mmxreverse",13,10

; --------------------------

past:

pop edi
pop esi
pop ebx

ret

main endp

; ¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤

align 16

OPTION PROLOGUE:NONE
OPTION EPILOGUE:NONE

reverse12bytes proc pbuf:DWORD

mov edx, [esp+4]

; ------------------------------------------
; byte swap and exchange the two 4 byte ends
; ------------------------------------------
mov ecx, [edx+8]
bswap ecx

mov eax, [edx]
bswap eax

mov [edx+8], eax
mov [edx], ecx

; ------------------------------------------
; byte swap the middle 4 bytes
; ------------------------------------------
mov eax, [edx+4]
bswap eax
mov [edx+4], eax

ret 4

reverse12bytes endp

OPTION PROLOGUE:PrologueDef
OPTION EPILOGUE:EpilogueDef

; ¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤

align 16

OPTION PROLOGUE:NONE
OPTION EPILOGUE:NONE

reverse16bytes proc ptritm:DWORD

.data
; ----------------------------------
; mask has 16 bytes in reverse order
; ----------------------------------
bmask00000000 \
db 0Fh,0Eh,0Dh,0Ch
db 0Bh,0Ah,09h,08h
db 07h,06h,05h,04h
db 03h,02h,01h,00h
pmsk00000000 dd bmask00000000
.code

mov eax, [esp+4] ; load source
movdqu xmm1, [eax]
mov ecx, pmsk00000000 ; load mask
movdqu xmm2, [ecx]

pshufb xmm1,xmm2 ; shuffle bytes according to mask order

movdqu [eax], xmm1 ; copy result back to memory

ret 4

reverse16bytes endp

OPTION PROLOGUE:PrologueDef
OPTION EPILOGUE:EpilogueDef

; ¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤

align 16

OPTION PROLOGUE:NONE
OPTION EPILOGUE:NONE

mmxreverse proc pvalu:DWORD

.data
mask001 \
db 07h,06h,05h,04h
db 03h,02h,01h,00h
pmsk001 dd mask001
.code

mov eax, [esp+4]

mov ecx, pmsk001
movq mm1, [eax]
movq mm2, [eax+8]
movq mm3, [ecx]

pshufb mm1,mm3
pshufb mm2,mm3

movq [eax+8], mm1
movq [eax], mm2

ret 4

mmxreverse endp

OPTION PROLOGUE:PrologueDef
OPTION EPILOGUE:EpilogueDef

; ¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤

end start

These are typical results on my middle aged i7.

--------------------------------
performance test of 3 algorithms
--------------------------------
abcdefghifkl
12345678
123456789012
--------------------------------
benchmark of 3 algorithms
--------------------------------
2059 ms pshufb
2231 ms bswap
2215 ms mmxreverse
Press any key to continue ...

rrr314159 · March 23, 2015, 03:44:57 PM

Now I understand what pshufb does, thx, the Intel manual is confusing. Figured I'd see it used by somebody someday. It's not very impressive compared to old-fashioned moving the bytes around and bswapping, 2059 vs. 2231, of course that's 16 vs 12, but still, just 44% or so improvement. A lot of these new instructions are disappointing. Of course pshufb can shuffle other permutations with equal ease; but, requires 16 bytes for the mask. All in all, hardly seems like 10 years' worth of progress.

Do u use mmx? Thought it was pretty much dead, xmm better and doesn't interfere with FPU. I suppose you're just covering all the bases, as long as you're at it?

BTW you could use "movdqa xmm2, [ecx]" when u load the mask instead of unaligned movdqu, presumably a little faster.

Maybe pshufb will be more impressive on a "younger" i7.

hutch-- · March 23, 2015, 04:28:11 PM

It was mainly that I wanted to test if 2 x 64 bit ops were much slower than a single 128 bit op and its about 10% which is not a big factor in the target I had in mind. "pshufb" is better suited to streaming data in 128 bit chunks and doing things like BSWAP on 4 DWORDs really improves the speed. I used MOVDQU over the aligned version because on both my old Core2 quad and this i7 if the data was aligned there was barely any speed difference and the unaligned version is more flexible. I keep in mind that an i7 is still a 64 bit processor so 128 bit ops and still done internally as 2 x 64 bit ops but it seems the only gain with the 128 bit version is it has a lower instruction count.

sinsi · March 23, 2015, 07:17:01 PM

The version of ML that's with MASM32 won't assemble that.
Some results from different CPUs

Intel i7-4790 @3.6GHz
922 ms pshufb
1047 ms bswap
1015 ms mmxreverse

AMD A10-7850K @3.7GHz
1825 ms pshufb
1560 ms bswap
1904 ms mmxreverse

Intel Q9450 @2.66GHz
2437 ms pshufb
1860 ms bswap
4093 ms mmxreverse

Code Select

You know it's easier to select someone's code with the "select" link above, right?

jj2007 · March 23, 2015, 09:34:58 PM

i5:
1248 ms pshufb
1560 ms bswap
2480 ms mmxreverse

I could have sworn that we had a dedicated thread on reversing strings some months ago but can't find it :(

dedndave · March 23, 2015, 09:51:25 PM

maybe you were thinking of reversing bits ?

http://www.masmforum.com/board/index.php?topic=12722.0

jj2007 · March 23, 2015, 10:09:00 PM

Quote from: dedndave on March 23, 2015, 09:51:25 PM
maybe you were thinking of reversing bits ?

No, it was bytes, just a few months ago. Here is what I used afterwards:

sub edi, 4 ; ------------ revert string start ------------
mov edx, edi
add edx, f2sOverOne
sub edx, esi ; StrLen 3 and less?
push edx
jb @F
align 4 ; esi=start, edi=end-4
.Repeat
lodsd
mov edx, [edi]
bswap eax
bswap edx
mov [edi], eax
mov [esi-4], edx
sub edi, 4
.Until edi<esi
@@: add edi, 3
@@: cmp edi, esi
jb @F
mov al, byte ptr [esi]
mov dl, byte ptr [edi]
mov [esi], dl
mov [edi], al
inc esi
dec edi
jmp @B ; ------------ revert string end ------------

The MASM Forum

News:

Component test using bswap, xmm and mmx.

hutch--

rrr314159

hutch--

sinsi

jj2007

dedndave

jj2007