News:

Masm32 SDK description, downloads and other helpful links
Message to All Guests

Main Menu

Libraries vs Win API

Started by Manos, May 07, 2013, 01:24:32 AM

Previous topic - Next topic

Manos

 :idea:
Because my previous thread "C/C++ vs Assembler" became very long and out of its target, I begin a new one.

I never believed that Win API was slower than crt or other libraries.
My previous thread gave me the chance to verify that.
Therefor, I thinked to test and other functions.

I wrote a simple function, named strCopy.
This is the code:

strCopy proc pDest:DWORD, pSource:DWORD
   mov ecx, pSource   
    mov edx, pDest      
    copyLoop:
       mov al, byte ptr [ecx]   
       inc ecx         
       mov byte ptr [edx],al
       inc edx         
       cmp al, 0      
       jnz copyLoop
      mov al, byte ptr 0
   ret
strCopy endp


The above function take advantage of two channels of processor.

I tested this with Hutch's library szCopy, with lstrcpy WinAPI and with crt strcpy.
Follow are my source and the average of results:

.data

szText            db "abcdefghijklmnopqrstuvwxyz123456789", 0
szBuffer         db 0 dup (64)

includelib \MSVCRT.LIB

strcpy PROTO C :DWORD, :DWORD
strCopy PROTO pDest:DWORD, pSource:DWORD

;......................................
LOCAL dwTime   :DWORD

invoke GetTickCount
      mov dwTime, eax
      push esi
      xor esi, esi
      TestLoop:
         invoke lstrcpy, addr szBuffer, addr szText
      ;   invoke strcpy, addr szBuffer, addr szText
      ;   invoke szCopy, addr szText, addr szBuffer
      ;   invoke strCopy, addr szBuffer, addr szText
         inc esi
         cmp esi, 10000000
         jb TestLoop
   
   pop esi   
      
   invoke GetTickCount
   sub eax, dwTime
   PrintDec eax


Results (average):

lstrcpy (API)  Ticks 570
szCopy (Hutch lib)  Ticks 515
strCopy (mine)   Ticks 359
strcpy (crt)   Ticks 172

Manos.









jj2007

Nothing beats rep movsd. If you need inspiration, try this and this thread.

Or, even better, the Code location sensitivity of timings thread, but beware, it's advanced stuff. Unfortunately the thread no longer has the attachments, but here they are below.

RuiLoureiro

Quote from: jj2007 on May 07, 2013, 02:03:03 AM
Nothing beats rep movsd.
Jochen, i dont get it in my P4  :icon14:
              Well i tested rep movsb

jj2007

Quote from: RuiLoureiro on May 07, 2013, 02:30:03 AM
              Jochen, i dont get it in my P4  :icon14:

No 5, MemCoP4 is best for you ;-)

hutch--

Manos,

Here is a quick optimisation for your original algo, 1 less instruction per iteration and unrolled by 2.


IF 0  ; ¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤
                      Build this template with "CONSOLE ASSEMBLE AND LINK"
ENDIF ; ¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤

    include \masm32\include\masm32rt.inc

    StrCpy2 PROTO src:DWORD,dst:DWORD
    strCopy PROTO pDest:DWORD, pSource:DWORD

    .data
    align 4
      item db "The game is done, I've won I've won quote she and whistled thrice",0
    align 4
      buff db "                                                                     "

    .code

start:
   
; ¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤

    call main
    inkey
    exit

; ¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤

main proc

    push esi

  REPEAT 8

    invoke GetTickCount
    push eax

    mov esi, 10000000
  @@:
    invoke strCopy, ADDR buff,ADDR item
    sub esi, 1
    jnz @B

    invoke GetTickCount
    pop ecx
    sub eax, ecx

    print ustr$(eax)," Original",13,10

    invoke GetTickCount
    push eax

    mov esi, 10000000
  @@:
    invoke StrCpy2, ADDR item,ADDR buff
    sub esi, 1
    jnz @B

    invoke GetTickCount
    pop ecx
    sub eax, ecx

    print ustr$(eax)," Modified",13,10

  ENDM

    pop esi

    ret

main endp

; ¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤

    align 4

strCopy proc pDest:DWORD, pSource:DWORD
   mov ecx, pSource   
    mov edx, pDest     
    copyLoop:
       mov al, byte ptr [ecx]   
       inc ecx         
       mov byte ptr [edx],al
       inc edx         
       cmp al, 0     
       jnz copyLoop
      mov al, byte ptr 0
   ret
strCopy endp

; ¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤

    align 4

OPTION PROLOGUE:NONE
OPTION EPILOGUE:NONE

StrCpy2 proc src:DWORD,dst:DWORD

    mov ecx, [esp+4]    ; src
    mov edx, [esp+8]    ; dst
    push esi
    mov esi, -1

  @@:
    add esi, 1
    movzx eax, BYTE PTR [ecx+esi]
    mov BYTE PTR [edx+esi], al
    test eax, eax
    jz @F

    add esi, 1
    movzx eax, BYTE PTR [ecx+esi]
    mov BYTE PTR [edx+esi], al
    test eax, eax
    jnz @B

  @@:

    pop esi
    ret 8

StrCpy2 endp

OPTION PROLOGUE:PrologueDef
OPTION EPILOGUE:EpilogueDef

; ¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤

end start


Timing on my Core2 Quad.


531 Original
344 Modified
531 Original
344 Modified
516 Original
359 Modified
516 Original
343 Modified
532 Original
343 Modified
532 Original
343 Modified
516 Original
359 Modified
516 Original
344 Modified
Press any key to continue ...

Vortex

Pentium IV, 3.2 Ghz :

1000 Original
844 Modified
984 Original
844 Modified
984 Original
844 Modified
969 Original
859 Modified
969 Original
843 Modified
985 Original
844 Modified
984 Original
844 Modified
984 Original
844 Modified
Press any key to continue ...

Manos

Steve,

your function is much faster than my own.

Results, (average):

strCopy (mine) 360 ticks
StrCpy2 (your) 270 ticks


This is because, you have avoid increments by putting two times
the body of instructions set in the function.

I wrote a new one, named strCopyNew using the same trick
and the results are identical like your function.

strCopyNew proc pDest:DWORD, pSource:DWORD
   mov ecx, pSource   
    mov edx, pDest
   push esi
   mov esi, -1
    @@:
      add esi, 1
       movzx eax, byte ptr [ecx + esi]   
       mov byte ptr [edx + esi], al
       test eax, eax
       jz @F
   
      add esi, 1
       movzx eax, byte ptr [ecx + esi]   
       mov byte ptr [edx + esi], al
       test eax, eax
       jnz @B
   
   @@:   
   pop esi
   ret
strCopyNew endp


Results, (average):
strCopyNew 270 ticks

Manos.

RuiLoureiro

Quote from: hutch-- on May 08, 2013, 10:53:51 PM
Manos,
Here is a quick optimisation for your original algo, 1 less instruction per iteration and unrolled by 2.
Hutch, here another optimisation: 1 less instruction per iteration

Results on my P4
Quote
1016 Original
844 Modified
781 Modified again
969 Original
844 Modified
796 Modified again
985 Original
844 Modified
765 Modified again
1000 Original
844 Modified
781 Modified again
985 Original
859 Modified
781 Modified again
1000 Original
828 Modified
782 Modified again
984 Original
859 Modified
766 Modified again
984 Original
844 Modified
781 Modified again
Press any key to continue ...


IF 0  ; ¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤
                      Build this template with "CONSOLE ASSEMBLE AND LINK"
ENDIF ; ¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤

    include \masm32\include\masm32rt.inc

    StrCpy3 PROTO src:DWORD,dst:DWORD
    StrCpy2 PROTO src:DWORD,dst:DWORD
    strCopy PROTO pDest:DWORD, pSource:DWORD

    .data
    align 4
      item db "The game is done, I've won I've won quote she and whistled thrice",0
    align 4
      buff db "                                                                     "

    .code

start:
   
; ¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤

    call main
    inkey
    exit

; ¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤

main proc

    push esi

  REPEAT 8

    invoke GetTickCount
    push eax

    mov esi, 10000000
  @@:
    invoke strCopy, ADDR buff,ADDR item
    sub esi, 1
    jnz @B

    invoke GetTickCount
    pop ecx
    sub eax, ecx

    print ustr$(eax)," Original",13,10
;----------------------------------------
    invoke GetTickCount
    push eax
    mov esi, 10000000
  @@:
    invoke StrCpy2, ADDR item,ADDR buff
    sub esi, 1
    jnz @B

    invoke GetTickCount
    pop ecx
    sub eax, ecx

    print ustr$(eax)," Modified",13,10

;----------------------------------------
    invoke GetTickCount
    push eax
    mov esi, 10000000
  @@:
    invoke StrCpy3, ADDR item,ADDR buff
    sub esi, 1
    jnz @B

    invoke GetTickCount
    pop ecx
    sub eax, ecx

    print ustr$(eax)," Modified again",13,10

  ENDM

    pop esi

    ret

main endp

; ¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤

    align 4

strCopy proc pDest:DWORD, pSource:DWORD
   mov ecx, pSource   
    mov edx, pDest     
    copyLoop:
       mov al, byte ptr [ecx]   
       inc ecx         
       mov byte ptr [edx],al
       inc edx         
       cmp al, 0     
       jnz copyLoop
      mov al, byte ptr 0
   ret
strCopy endp

; ¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤

    align 4

OPTION PROLOGUE:NONE
OPTION EPILOGUE:NONE

StrCpy2 proc src:DWORD,dst:DWORD

    mov ecx, [esp+4]    ; src
    mov edx, [esp+8]    ; dst
    push esi
    mov esi, -1

  @@:
    add esi, 1
    movzx eax, BYTE PTR [ecx+esi]
    mov BYTE PTR [edx+esi], al
    test eax, eax
    jz @F

    add esi, 1
    movzx eax, BYTE PTR [ecx+esi]
    mov BYTE PTR [edx+esi], al
    test eax, eax
    jnz @B

  @@:

    pop esi
    ret 8

StrCpy2 endp

OPTION PROLOGUE:PrologueDef
OPTION EPILOGUE:EpilogueDef

; ¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤

    align 4

OPTION PROLOGUE:NONE
OPTION EPILOGUE:NONE

StrCpy3 proc src:DWORD,dst:DWORD

    mov ecx, [esp+4]    ; src
    mov edx, [esp+8]    ; dst
    push esi
    mov esi, -1

  @@:
    add esi, 1
    movzx eax, WORD PTR [ecx+esi]
    mov BYTE PTR [edx+esi], al
    or   al, al
    jz @F

    add esi, 1
    ;movzx eax, BYTE PTR [ecx+esi]
    mov BYTE PTR [edx+esi], ah
    or  ah, ah
    jnz @B

  @@:

    pop esi
    ret 8

StrCpy3 endp

OPTION PROLOGUE:PrologueDef
OPTION EPILOGUE:EpilogueDef

; ¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤

end start

RuiLoureiro

Here another optimisation

Quote
1015 Original
829 Modified
750 Modified again
671 Modified again 2
954 Original
828 Modified
750 Modified again
672 Modified again 2
953 Original
812 Modified
766 Modified again
703 Modified again 2
937 Original
829 Modified
750 Modified again
703 Modified again 2
937 Original
813 Modified
750 Modified again
672 Modified again 2
937 Original
828 Modified
750 Modified again
656 Modified again 2
938 Original
812 Modified
750 Modified again
656 Modified again 2
969 Original
813 Modified
765 Modified again
656 Modified again 2
Press any key to continue ...


IF 0  ; ¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤
                      Build this template with "CONSOLE ASSEMBLE AND LINK"
ENDIF ; ¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤

    include \masm32\include\masm32rt.inc

    StrCpy4 PROTO src:DWORD,dst:DWORD
    StrCpy3 PROTO src:DWORD,dst:DWORD
    StrCpy2 PROTO src:DWORD,dst:DWORD
    strCopy PROTO pDest:DWORD, pSource:DWORD

    .data
    align 4
      item db "The game is done, I've won I've won quote she and whistled thrice",0
    align 4
      buff db "                                                                     "

    .code

start:
   
; ¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤

    call main
    inkey
    exit

; ¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤

main proc

    push esi

    invoke StrCpy3, ADDR item,ADDR buff
    print   addr buff,13,10
    invoke StrCpy4, ADDR item,ADDR buff
    print   addr buff,13,10


  REPEAT 8

    invoke GetTickCount
    push eax

    mov esi, 10000000
  @@:
    invoke strCopy, ADDR buff,ADDR item
    sub esi, 1
    jnz @B

    invoke GetTickCount
    pop ecx
    sub eax, ecx

    print ustr$(eax)," Original",13,10
;----------------------------------------
    invoke GetTickCount
    push eax
    mov esi, 10000000
  @@:
    invoke StrCpy2, ADDR item,ADDR buff
    sub esi, 1
    jnz @B

    invoke GetTickCount
    pop ecx
    sub eax, ecx

    print ustr$(eax)," Modified",13,10

;----------------------------------------
    invoke GetTickCount
    push eax
    mov esi, 10000000
  @@:
    invoke StrCpy3, ADDR item,ADDR buff
    sub esi, 1
    jnz @B

    invoke GetTickCount
    pop ecx
    sub eax, ecx

    print ustr$(eax)," Modified again",13,10

;----------------------------------------
    invoke GetTickCount
    push eax
    mov esi, 10000000
  @@:
    invoke StrCpy4, ADDR item,ADDR buff
    sub esi, 1
    jnz @B

    invoke GetTickCount
    pop ecx
    sub eax, ecx

    print ustr$(eax)," Modified again 2",13,10

  ENDM

    pop esi

    ret

main endp

; ¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤

    align 4

strCopy proc pDest:DWORD, pSource:DWORD
   mov ecx, pSource   
    mov edx, pDest     
    copyLoop:
       mov al, byte ptr [ecx]   
       inc ecx         
       mov byte ptr [edx],al
       inc edx         
       cmp al, 0     
       jnz copyLoop
      mov al, byte ptr 0
   ret
strCopy endp

; ¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤

    align 4

OPTION PROLOGUE:NONE
OPTION EPILOGUE:NONE

StrCpy2 proc src:DWORD,dst:DWORD

    mov ecx, [esp+4]    ; src
    mov edx, [esp+8]    ; dst
    push esi
    mov esi, -1

  @@:
    add esi, 1
    movzx eax, BYTE PTR [ecx+esi]
    mov BYTE PTR [edx+esi], al
    test eax, eax
    jz @F

    add esi, 1
    movzx eax, BYTE PTR [ecx+esi]
    mov BYTE PTR [edx+esi], al
    test eax, eax
    jnz @B

  @@:

    pop esi
    ret 8

StrCpy2 endp

OPTION PROLOGUE:PrologueDef
OPTION EPILOGUE:EpilogueDef

; ¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤

    align 4

OPTION PROLOGUE:NONE
OPTION EPILOGUE:NONE

StrCpy3 proc src:DWORD,dst:DWORD

    mov ecx, [esp+4]    ; src
    mov edx, [esp+8]    ; dst
    push esi
    mov esi, -1

  @@:
    add esi, 1
    movzx eax, WORD PTR [ecx+esi]
    mov BYTE PTR [edx+esi], al
    or   al, al
    jz @F

    add esi, 1
    ;movzx eax, BYTE PTR [ecx+esi]
    mov BYTE PTR [edx+esi], ah
    or  ah, ah
    jnz @B

  @@:

    pop esi
    ret 8

StrCpy3 endp

OPTION PROLOGUE:PrologueDef
OPTION EPILOGUE:EpilogueDef

; ¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤

    align 4

OPTION PROLOGUE:NONE
OPTION EPILOGUE:NONE

StrCpy4 proc src:DWORD,dst:DWORD
    mov ecx, [esp+4]    ; src
    mov edx, [esp+8]    ; dst
    push esi
    xor     esi, esi
  @@:
    movzx eax, WORD PTR [ecx+esi]
    mov BYTE PTR [edx+esi], al
    or   al, al
    jz @F

    mov BYTE PTR [edx+esi+1], ah
    add esi, 2

    or  ah, ah
    jnz @B

  @@:

    pop esi
    ret 8

StrCpy4 endp
OPTION PROLOGUE:PrologueDef
OPTION EPILOGUE:EpilogueDef

; ¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤

    align 4

OPTION PROLOGUE:NONE
OPTION EPILOGUE:NONE

StrCpy5 proc src:DWORD,dst:DWORD

    mov ecx, [esp+4]    ; src
    mov edx, [esp+8]    ; dst
    push esi
    mov esi, -1

  @@:
    add esi, 1
    mov eax, [ecx+esi]
    mov BYTE PTR [edx+esi], al
    or   al, al
    jz   @F

    add esi, 1
    mov BYTE PTR [edx+esi], ah
    or  ah, ah
    jz   @F

    shr eax, 16
    add esi, 1
    mov BYTE PTR [edx+esi], al
    or  al, al
    jz  @F

    add esi, 1
    mov BYTE PTR [edx+esi], ah
    or  ah, ah
    jnz @B

  @@:

    pop esi
    ret 8

StrCpy5 endp

OPTION PROLOGUE:PrologueDef
OPTION EPILOGUE:EpilogueDef

; ¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤

end start

jj2007

Celeron M:
1016 Original
891 Modified
890 Modified again
891 Modified again 2
890 Modified again J
672 MasmBasic copy


"Modified again J" is a dword load plus bswap:
  @@:
    inc esi
    mov eax, [ecx+esi]
    mov BYTE PTR [edx+esi], al
    test al, al
    jz @F

    inc esi
    mov BYTE PTR [edx+esi], ah
    test ah, ah
    jz @F
   
    bswap eax
    inc esi
    mov BYTE PTR [edx+esi], ah
    test ah, ah
    jz @F

    inc esi
    mov BYTE PTR [edx+esi], al
    test al, al
    jnz @B


test al, al instead of or al, al is a lot faster on my CPU.

RuiLoureiro

Jochen,
             or is faster on my P4
             i tried StrCpy5 but not faster ! But you found out bswap ! ;)

Manos

I tested again Hutch's new function, crt function and API.

.data

szText            db "abcdefghijklmnopqrstuvwxyz123456789", 0
szBuffer         db 0 dup (64)

includelib \MSVCRT.LIB

StrCpy2 PROTO src:DWORD,dst:DWORD
strcpy PROTO C :DWORD, :DWORD

Results:
lstrcpy   (API)         563
StrCpy2 (Hutch)    297
strcpy     (crt)          172


Manos.




qWord

I'm curious if one is interested in an more statistical approach of measurement, whereas the timing is taken for a single call to the corresponding code and several milliseconds are waited between the calls to make sure that the cache is not involved as it is for the loop-x-thousands-times method? Especially for memory expensive functions like MemCopy or table bases methods this might be an better approach for measure the speed...
MREAL macros - when you need floating point arithmetic while assembling!

hutch--

qWord is right here, stabilise the timings with a 100 ms delay gives a more realistic result, even though it does not change much. I did a quick play with the unroll rate and found 3 was very slightly faster where 4 and higher made no difference. Changing the proc alignment slowed it down and aligning the first label also slowed it down.

My own interest in these simple byte copy routines is how fast they are the first time, a factor I call "attack" over streamed tests as it is not uncommon to perform this capacity in the middle of a much more complex algorithm where the call overhead is a problem. Now while a 4 byte copy will usually be faster, it runs into alignment problems with string data which you cannot garrantee as 4 byte aligned where the simple byte level copy is insensitive to alignment.

Manos

Quote from: qWord on May 09, 2013, 08:34:40 AM
I'm curious if one is interested in an more statistical approach of measurement, whereas the timing is taken for a single call to the corresponding code and several milliseconds are waited between the calls to make sure that the cache is not involved as it is for the loop-x-thousands-times method?
If my poor English not mislead me, you means that I have done one only measurement for each case.
I inform you that I know to do measurements.
I am a physicist and the first thing that I taught in University is that we must do too many measurements and to take the average.
If look my first post in this thread, you 'll see that I refer in average.
If I write here my measurements, I 'll spend two pages.

I have done new test with a double lenght string.
Here are the results:

szText   db "abcdefghijklmnopqrstuvwxyz123456789abcdefghijklmnopqrstuvwxyz123456789", 0
szBuffer         db 0 dup (128)

lstrcpy  (API)         883
StrCpy2  (Hutch)  383
strcpy   (crt)           330


Manos.