This is not a working algorithm but a test of a component to reverse 12 bytes starting from the end of the string and reversing it so its the right way around at the start of a buffer. I had in mind a particular type of DWORD to ASCII conversion. The BSWAP version reverses 12 bytes, the mmx and xmm versions reverse 16 bytes.
Note that pshufb needs an SSSE3 machine to run the instruction.
; ¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤
comment * -----------------------------------------------------
Build this template with
"CONSOLE ASSEMBLE AND LINK"
----------------------------------------------------- *
include \masm32\include\masm32rt.inc
.686p
.mmx
.xmm
reverse12bytes PROTO :DWORD
reverse16bytes PROTO :DWORD
mmxreverse PROTO :DWORD
.data
; --------------------------
; item must be 16 bytes long
; --------------------------
item db 0,0,0,0,"lkfihgfedcba"
pitm dd item
; --------------------------
; item must be 12 bytes long
; --------------------------
item2 db 0,0,0,0,"87654321"
pitm2 dd item2
; --------------------------
; item must be 16 bytes long
; --------------------------
item3 db 0,0,0,0,"210987654321"
pitm3 dd item3
.code
start:
; ¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤
call main
inkey
exit
; ¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤
main proc
LOCAL tcnt :DWORD
push ebx
push esi
push edi
print "--------------------------------",13,10
print "performance test of 2 algorithms",13,10
print "--------------------------------",13,10
fn reverse16bytes,pitm
print pitm,13,10
fn reverse12bytes,pitm2
print pitm2,13,10
fn mmxreverse,pitm3
print pitm3,13,10
print "--------------------------------",13,10
print "benchmark of 3 algorithms",13,10
print "--------------------------------",13,10
; --------------------------
fn GetTickCount
push eax
mov esi, 500000000
; align 16
lbl2:
fn reverse16bytes,pitm
sub esi, 1
jnz lbl2
fn GetTickCount
pop ecx
sub eax, ecx
print ustr$(eax)," ms pshufb",13,10
; --------------------------
fn GetTickCount
push eax
mov esi, 500000000
; align 16
lbl1:
fn reverse12bytes,pitm2
sub esi, 1
jnz lbl1
fn GetTickCount
pop ecx
sub eax, ecx
print ustr$(eax)," ms bswap",13,10
; --------------------------
fn GetTickCount
push eax
mov esi, 500000000
; align 16
lbl3:
fn mmxreverse,pitm3
sub esi, 1
jnz lbl3
fn GetTickCount
pop ecx
sub eax, ecx
print ustr$(eax)," ms mmxreverse",13,10
; --------------------------
past:
pop edi
pop esi
pop ebx
ret
main endp
; ¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤
align 16
OPTION PROLOGUE:NONE
OPTION EPILOGUE:NONE
reverse12bytes proc pbuf:DWORD
mov edx, [esp+4]
; ------------------------------------------
; byte swap and exchange the two 4 byte ends
; ------------------------------------------
mov ecx, [edx+8]
bswap ecx
mov eax, [edx]
bswap eax
mov [edx+8], eax
mov [edx], ecx
; ------------------------------------------
; byte swap the middle 4 bytes
; ------------------------------------------
mov eax, [edx+4]
bswap eax
mov [edx+4], eax
ret 4
reverse12bytes endp
OPTION PROLOGUE:PrologueDef
OPTION EPILOGUE:EpilogueDef
; ¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤
align 16
OPTION PROLOGUE:NONE
OPTION EPILOGUE:NONE
reverse16bytes proc ptritm:DWORD
.data
; ----------------------------------
; mask has 16 bytes in reverse order
; ----------------------------------
bmask00000000 \
db 0Fh,0Eh,0Dh,0Ch
db 0Bh,0Ah,09h,08h
db 07h,06h,05h,04h
db 03h,02h,01h,00h
pmsk00000000 dd bmask00000000
.code
mov eax, [esp+4] ; load source
movdqu xmm1, [eax]
mov ecx, pmsk00000000 ; load mask
movdqu xmm2, [ecx]
pshufb xmm1,xmm2 ; shuffle bytes according to mask order
movdqu [eax], xmm1 ; copy result back to memory
ret 4
reverse16bytes endp
OPTION PROLOGUE:PrologueDef
OPTION EPILOGUE:EpilogueDef
; ¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤
align 16
OPTION PROLOGUE:NONE
OPTION EPILOGUE:NONE
mmxreverse proc pvalu:DWORD
.data
mask001 \
db 07h,06h,05h,04h
db 03h,02h,01h,00h
pmsk001 dd mask001
.code
mov eax, [esp+4]
mov ecx, pmsk001
movq mm1, [eax]
movq mm2, [eax+8]
movq mm3, [ecx]
pshufb mm1,mm3
pshufb mm2,mm3
movq [eax+8], mm1
movq [eax], mm2
ret 4
mmxreverse endp
OPTION PROLOGUE:PrologueDef
OPTION EPILOGUE:EpilogueDef
; ¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤
end start
These are typical results on my middle aged i7.
--------------------------------
performance test of 3 algorithms
--------------------------------
abcdefghifkl
12345678
123456789012
--------------------------------
benchmark of 3 algorithms
--------------------------------
2059 ms pshufb
2231 ms bswap
2215 ms mmxreverse
Press any key to continue ...
Now I understand what pshufb does, thx, the Intel manual is confusing. Figured I'd see it used by somebody someday. It's not very impressive compared to old-fashioned moving the bytes around and bswapping, 2059 vs. 2231, of course that's 16 vs 12, but still, just 44% or so improvement. A lot of these new instructions are disappointing. Of course pshufb can shuffle other permutations with equal ease; but, requires 16 bytes for the mask. All in all, hardly seems like 10 years' worth of progress.
Do u use mmx? Thought it was pretty much dead, xmm better and doesn't interfere with FPU. I suppose you're just covering all the bases, as long as you're at it?
BTW you could use "movdqa xmm2, [ecx]" when u load the mask instead of unaligned movdqu, presumably a little faster.
Maybe pshufb will be more impressive on a "younger" i7.
It was mainly that I wanted to test if 2 x 64 bit ops were much slower than a single 128 bit op and its about 10% which is not a big factor in the target I had in mind. "pshufb" is better suited to streaming data in 128 bit chunks and doing things like BSWAP on 4 DWORDs really improves the speed. I used MOVDQU over the aligned version because on both my old Core2 quad and this i7 if the data was aligned there was barely any speed difference and the unaligned version is more flexible. I keep in mind that an i7 is still a 64 bit processor so 128 bit ops and still done internally as 2 x 64 bit ops but it seems the only gain with the 128 bit version is it has a lower instruction count.
The version of ML that's with MASM32 won't assemble that.
Some results from different CPUs
Intel i7-4790 @3.6GHz
922 ms pshufb
1047 ms bswap
1015 ms mmxreverse
AMD A10-7850K @3.7GHz
1825 ms pshufb
1560 ms bswap
1904 ms mmxreverse
Intel Q9450 @2.66GHz
2437 ms pshufb
1860 ms bswap
4093 ms mmxreverse
You know it's easier to select someone's code with the "select" link above, right?
i5:
1248 ms pshufb
1560 ms bswap
2480 ms mmxreverse
I could have sworn that we had a dedicated thread on reversing strings some months ago but can't find it :(
maybe you were thinking of reversing bits ?
http://www.masmforum.com/board/index.php?topic=12722.0 (http://www.masmforum.com/board/index.php?topic=12722.0)
Quote from: dedndave on March 23, 2015, 09:51:25 PM
maybe you were thinking of reversing bits ?
No, it was bytes, just a few months ago. Here is what I used afterwards:
sub edi, 4 ; ------------ revert string start ------------
mov edx, edi
add edx, f2sOverOne
sub edx, esi ; StrLen 3 and less?
push edx
jb @F
align 4 ; esi=start, edi=end-4
.Repeat
lodsd
mov edx, [edi]
bswap eax
bswap edx
mov [edi], eax
mov [esi-4], edx
sub edi, 4
.Until edi<esi
@@: add edi, 3
@@: cmp edi, esi
jb @F
mov al, byte ptr [esi]
mov dl, byte ptr [edi]
mov [esi], dl
mov [edi], al
inc esi
dec edi
jmp @B ; ------------ revert string end ------------