News:

Masm32 SDK description, downloads and other helpful links
Message to All Guests

Main Menu

String reverse algo

Started by hutch--, May 07, 2021, 09:34:39 PM

Previous topic - Next topic

hutch--

I don't get much time to write 32 bit code these days but since its prototype was wasted in the campus, here is a derivation of the first version that is self contained and full register design. Nothing to test it against but it runs OK.

; ¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤
    include \masm32\include\masm32rt.inc
; ¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤

comment * -----------------------------------------------------
                        Build this  template with
                       "CONSOLE ASSEMBLE AND LINK"
        ----------------------------------------------------- *

    tRev PROTO pstr:DWORD

    .data
      numz db "1234567890",0

    .code

start:
   
; ¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤

    call main
    inkey
    exit

; ¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤

main proc

    LOCAL rval :DWORD

    invoke tRev,ADDR numz       ; reverse the number above
    mov rval, eax

    print rval,13,10

    ret

main endp

; ¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤

tRev proc pstr:DWORD
  ; ---------------------------------------------
  ; in place string reverse, fully self contained
  ; ---------------------------------------------
    push ebx                            ; save EBX
    push edi
    push esi

    mov edi, [ebp+8]
    mov eax, edi
    sub eax, 1

  @@:                                   ; unroll by 4
  REPEAT 3
    add eax, 1                          ; get length
    cmp BYTE PTR [eax], 0               ; loop to buffer end
    je @F
  ENDM
    add eax, 1                          ; get length
    cmp BYTE PTR [eax], 0               ; loop to buffer end
    jne @B

  @@:
    sub eax, edi                        ; sub address from eax for len
    mov esi, eax                        ; store length in esi
    shr eax, 1                          ; get integer half length of string
    mov ebx, eax                        ; store it in variable
    mov eax, edi                        ; eax has start char
    mov edx, edi                        ; load start address
    add edx, esi                        ; get end character address
    sub edx, 1                          ; sub 1 to get end char
    mov esi, ebx                        ; copy half len into edi

  @@:
    movzx ebx, BYTE PTR [eax]           ; load start char
    movzx ecx, BYTE PTR [edx]           ; load end char
    mov [eax], cl                       ; reverse char pair
    add eax, 1                          ; increment start char
    mov [edx], bl
    sub edx, 1                          ; decrement end char
    sub esi, 1                          ; sub 1 from half length
    jnz @B                              ; loop back if its not zero

    mov eax, edi                        ; return the address in eax

    pop esi
    pop edi
    pop ebx                             ; restore EBX

    ret 4                               ; balance the stack by 4 bytes

tRev endp

; ¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤

end start

jj2007

Since we are in the Lab :biggrin:

Intel(R) Core(TM) i5-2450M CPU @ 2.50GHz (SSE4)

25756   cycles for 100 * tRev (Hutch)
33742   cycles for 100 * strrev (CRT)
13687   cycles for 100 * jRev (jj)

24425   cycles for 100 * tRev (Hutch)
33634   cycles for 100 * strrev (CRT)
13737   cycles for 100 * jRev (jj)

24779   cycles for 100 * tRev (Hutch)
33666   cycles for 100 * strrev (CRT)
13671   cycles for 100 * jRev (jj)

24694   cycles for 100 * tRev (Hutch)
33700   cycles for 100 * strrev (CRT)
13758   cycles for 100 * jRev (jj)

107     bytes for tRev (Hutch)
19      bytes for strrev (CRT)
51      bytes for jRev (jj)

1819043144      = eax tRev (Hutch)
1819043144      = eax strrev (CRT)
1819043144      = eax jRev (jj)

FORTRANS

Hi,

   Two results.

Intel(R) Core(TM) i3-4005U CPU @ 1.70GHz (SSE4)

25531 cycles for 100 * tRev (Hutch)
38912 cycles for 100 * strrev (CRT)
14510 cycles for 100 * jRev (jj)

25745 cycles for 100 * tRev (Hutch)
38790 cycles for 100 * strrev (CRT)
14539 cycles for 100 * jRev (jj)

25463 cycles for 100 * tRev (Hutch)
39089 cycles for 100 * strrev (CRT)
14285 cycles for 100 * jRev (jj)

25515 cycles for 100 * tRev (Hutch)
38875 cycles for 100 * strrev (CRT)
14330 cycles for 100 * jRev (jj)

107 bytes for tRev (Hutch)
19 bytes for strrev (CRT)
51 bytes for jRev (jj)

1819043144 = eax tRev (Hutch)
1819043144 = eax strrev (CRT)
1819043144 = eax jRev (jj)

--- ok ---
Intel(R) Pentium(R) M processor 1.70GHz (SSE2)

37301 cycles for 100 * tRev (Hutch)
65684 cycles for 100 * strrev (CRT)
24447 cycles for 100 * jRev (jj)

37206 cycles for 100 * tRev (Hutch)
65876 cycles for 100 * strrev (CRT)
24418 cycles for 100 * jRev (jj)

37151 cycles for 100 * tRev (Hutch)
65790 cycles for 100 * strrev (CRT)
24463 cycles for 100 * jRev (jj)

37155 cycles for 100 * tRev (Hutch)
65919 cycles for 100 * strrev (CRT)
24468 cycles for 100 * jRev (jj)

107 bytes for tRev (Hutch)
19 bytes for strrev (CRT)
51 bytes for jRev (jj)

1819043144 = eax tRev (Hutch)
1819043144 = eax strrev (CRT)
1819043144 = eax jRev (jj)

--- ok ---


Cheers,

Steve N.

jj2007

#3
Thanks, Steve. Here is another one called xRev, using pshufb:

Intel(R) Core(TM) i5-2450M CPU @ 2.50GHz (SSE4)

24424   cycles for 100 * tRev (Hutch)
33751   cycles for 100 * strrev (CRT)
13704   cycles for 100 * jRev (jj)
14602   cycles for 100 * Mirror$ (MasmBasic)
4662    cycles for 100 * xRev (jj)

24513   cycles for 100 * tRev (Hutch)
33951   cycles for 100 * strrev (CRT)
13689   cycles for 100 * jRev (jj)
14589   cycles for 100 * Mirror$ (MasmBasic)
4659    cycles for 100 * xRev (jj)

24555   cycles for 100 * tRev (Hutch)
33987   cycles for 100 * strrev (CRT)
13773   cycles for 100 * jRev (jj)
14670   cycles for 100 * Mirror$ (MasmBasic)
4677    cycles for 100 * xRev (jj)

24750   cycles for 100 * tRev (Hutch)
33709   cycles for 100 * strrev (CRT)
13749   cycles for 100 * jRev (jj)
14620   cycles for 100 * Mirror$ (MasmBasic)
4721    cycles for 100 * xRev (jj)

107     bytes for tRev (Hutch)
19      bytes for strrev (CRT)
51      bytes for jRev (jj)
12      bytes for Mirror$ (MasmBasic)
87      bytes for xRev (jj)

1819043144      = eax tRev (Hutch)
1819043144      = eax strrev (CRT)
1819043144      = eax jRev (jj)
2053468783      = eax Mirror$ (MasmBasic)
1819043144      = eax xRev (jj)


Note Mirror$() and Caballero's algo write to a buffer, i.e. they are not in-place algos.

Do not use for serious stuff - this is work in progress and not fully tested :cool:

P.S.: I've added Lingo's algo (fast but crashes for unaligned strings), and Caballero's contribution.

daydreamer

wouldnt it be more standard today with unicode 16 reverse string?

my none asm creations
https://masm32.com/board/index.php?topic=6937.msg74303#msg74303
I am an Invoker
"An Invoker is a mage who specializes in the manipulation of raw and elemental energies."
Like SIMD coding

hutch--

Write one for us Magnus.  :thup:

FORTRANS

Hi,

   One result and a partial.

Intel(R) Core(TM) i3-4005U CPU @ 1.70GHz (SSE4)

25930 cycles for 100 * tRev (Hutch)
39465 cycles for 100 * strrev (CRT)
14214 cycles for 100 * jRev (jj)
17288 cycles for 100 * Mirror$ (MasmBasic)
5350 cycles for 100 * xRev (jj, needs pshufb)
6834 cycles for 100 * RevLingo (aligned strings only)
113656 cycles for 100 * cRev (caballero)

26047 cycles for 100 * tRev (Hutch)
40261 cycles for 100 * strrev (CRT)
14364 cycles for 100 * jRev (jj)
18570 cycles for 100 * Mirror$ (MasmBasic)
5377 cycles for 100 * xRev (jj, needs pshufb)
6637 cycles for 100 * RevLingo (aligned strings only)
114863 cycles for 100 * cRev (caballero)

26578 cycles for 100 * tRev (Hutch)
39065 cycles for 100 * strrev (CRT)
14363 cycles for 100 * jRev (jj)
16243 cycles for 100 * Mirror$ (MasmBasic)
5453 cycles for 100 * xRev (jj, needs pshufb)
6650 cycles for 100 * RevLingo (aligned strings only)
115443 cycles for 100 * cRev (caballero)

26088 cycles for 100 * tRev (Hutch)
39862 cycles for 100 * strrev (CRT)
14335 cycles for 100 * jRev (jj)
16200 cycles for 100 * Mirror$ (MasmBasic)
5371 cycles for 100 * xRev (jj, needs pshufb)
6628 cycles for 100 * RevLingo (aligned strings only)
115281 cycles for 100 * cRev (caballero)

107 bytes for tRev (Hutch)
19 bytes for strrev (CRT)
51 bytes for jRev (jj)
10 bytes for Mirror$ (MasmBasic)
87 bytes for xRev (jj, needs pshufb)
123 bytes for RevLingo (aligned strings only)
56 bytes for cRev (caballero)

1259388704 = eax tRev (Hutch)
1259388704 = eax strrev (CRT)
1259388704 = eax jRev (jj)
1259388704 = eax Mirror$ (MasmBasic)
1259388704 = eax xRev (jj, needs pshufb)
1259388704 = eax RevLingo (aligned strings only)
1259388704 = eax cRev (caballero)

--- ok ---

Intel(R) Pentium(R) M processor 1.70GHz (SSE2)

37197 cycles for 100 * tRev (Hutch)
65773 cycles for 100 * strrev (CRT)
24460 cycles for 100 * jRev (jj)
28888 cycles for 100 * Mirror$ (MasmBasic)


Regards,

Steve N.

jj2007

Thanks, Steve. Apparently your Pentium M doesn't feature pshufb...

avcaballero

Hi, I wonder why is my version soooo slow. As I see, I think the problem is the use of "lstrlen" each time cRev proc is called. If we know that the string is always 100 chars long, we could change the code as follows. It should be faster now.

cRev proc uses esi edi pSrc, pDest
  mov esi, pSrc
  ;invoke lstrlen, esi
  ;mov ecx, eax
  mov ecx, 100
  mov edi, pDest
  add edi, ecx
  mov byte ptr [edi], 0
  dec edi
  @Bucle:
  movsb
  sub edi, 2
  loop @Bucle
  ret
cRev endp


jj2007

The new version is indeed a bit faster. However, all other algos check for the length of the string.

cRev proc uses esi edi pSrc, pDest
  mov esi, pSrc
  if 0
mov ecx, 100 ; 75800 cycles for 100 * cRev (caballero)
  else
invoke lstrlen, esi ; 96500 cycles
mov ecx, eax
  endif

hutch--

I don't know the range of designs but any string reverse should be able to handle variable length strings. To be fair to all techniques, ranging from 2 or 3 characters to a couple of hundred characters would give you more useful results.

avcaballero

I have done some testings. It seems to penalize a lot using movs and even loop. Finally the best of all is using dwords as much as possible. All of them use the str length.

It was a bit hard to me compile jj's code, so I created my own to test it.

These are the results:

Test Algorithm 7:
Hello, this is a simple string intended for testing string algos. It has 100 characters without zero
orez tuohtiw sretcarahc 001 sah tI .sogla gnirts gnitset rof dednetni gnirts elpmis a si siht ,olleH

Test Algorithm 6:
Hello, this is a simple string intended for testing string algos. It has 100 characters without zero
orez tuohtiw sretcarahc 001 sah tI .sogla gnirts gnitset rof dednetni gnirts elpmis a si siht ,olleH

Test Algorithm 5:
Hello, this is a simple string intended for testing string algos. It has 100 characters without zero
orez tuohtiw sretcarahc 001 sah tI .sogla gnirts gnitset rof dednetni gnirts elpmis a si siht ,olleH

Test Algorithm 4:
Hello, this is a simple string intended for testing string algos. It has 100 characters without zero
orez tuohtiw sretcarahc 001 sah tI .sogla gnirts gnitset rof dednetni gnirts elpmis a si siht ,olleH

Test Algorithm 3:
Hello, this is a simple string intended for testing string algos. It has 100 characters without zero
orez tuohtiw sretcarahc 001 sah tI .sogla gnirts gnitset rof dednetni gnirts elpmis a si siht ,olleH

Test Algorithm 2:
Hello, this is a simple string intended for testing string algos. It has 100 characters without zero
orez tuohtiw sretcarahc 001 sah tI .sogla gnirts gnitset rof dednetni gnirts elpmis a si siht ,olleH

Test Algorithm 1:
Hello, this is a simple string intended for testing string algos. It has 100 characters without zero
orez tuohtiw sretcarahc 001 sah tI .sogla gnirts gnitset rof dednetni gnirts elpmis a si siht ,olleH

Test Timming Algorithm 7: 94
Test Timming Algorithm 6: 78
Test Timming Algorithm 5:  109
Test Timming Algorithm 4:  219
Test Timming Algorithm 3:  203
Test Timming Algorithm 2:  313
Test Timming Algorithm 1:  312


algo 1: lstrlen, movsb, loop ecx
algo 2: Longitud (my own subroutine to calc str_len), movsb, loop ecx
algo 3: lstrlen, inc/dec, loop ecx
algo 4: Longitud, inc/dec, loop ecx
algo 5: Longitud, inc/dec, dec ecx
algo 6: lstrlen, dwords, dec ecx
algo 7: Longitud, dwords, dec ecx


hutch--

caballero,

Try something like this.

    mov eax, pTxt
    sub eax, 1
  @@:
    add eax, 1
    cmp BYTE PTR [eax], 0
    jne @B
    sub eax, pTxt

Length is in eax.

avcaballero

@Hutch
Yes, that's faster  :thumbsup:. It definitely doesn't seem like a good idea to use movs niether scas

nidud

#14
deleted