Recent Posts

Pages: [1] 2 3 ... 10
1
The Laboratory / Re: Faster Memcopy ...
« Last post by guga on Today at 03:26:18 PM »
ABout the Unicode version. JJ, i ported yours to work with Unicode as well, and it seem to work ok, and it is faster on my machine. Can you give a test on your benchmark function to compare the speeds, pls ?

JJ Unicode Version
Code: [Select]
Proc UniStrLenJJ:
    Arguments @Input
    Uses ecx, edx ; <--- preserved register
   
    mov eax D@Input
    mov ecx eax ; much faster than [esp+4]
    and eax 0-16
    and ecx 16-1
    or edx 0-1
    shl edx cl
    xorps xmm0 xmm0
    pcmpeqw xmm0 X$eax
    add eax 16
    pmovmskb ecx xmm0
    xorps xmm0 xmm0
    and ecx edx
    jnz L2>
L1:
    movups xmm1 X$eax ; <---- Changed to unaligned version on both unicode and ansi version. A bit faster on my machine, and prevent crashing on unaligned strings calculation)
    pcmpeqw xmm1 xmm0
    pmovmskb ecx xmm1
    add eax 16
    test ecx ecx
    jz L1<
L2:
    bsf ecx ecx
    lea eax D$eax+ecx-16
    sub eax D@Input
    shr eax 1

EndP

JJ Ansi version
Code: [Select]
Proc StrLenJJ:
    Arguments @Input
    Uses ecx, edx
   
    mov eax D@Input
    mov ecx eax ; much faster than [esp+4]
    and eax 0-16
    and ecx 16-1
    or edx 0-1
    shl edx cl
    xorps xmm0 xmm0
    pcmpeqb xmm0 X$eax
    add eax 16
    pmovmskb ecx xmm0
    xorps xmm0 xmm0
    and ecx edx
    jnz L2>
L1:
    movups xmm1 X$eax
    pcmpeqb xmm1 xmm0
    pmovmskb ecx xmm1
    add eax 16
    test ecx ecx
    jz L1<
L2:
    bsf ecx,ecx
    lea eax D$eax+ecx-16
    sub eax D@Input

EndP

On my tests, in terms of speed the Ansi version and Unicode version does not varies that much in speed

Ansi version usign the following string (99 chars...I´m testing odd and even strings, big or tiny as 1 byte only):
[TestAnsi: B$ "Hello, this is a simple string intended for testing string algos. It has 100 characters without zer" , 0]
Average timming (in nanosecs): 2.95913440552019 ns

Unicode version
[TestUnicode: U$ "Hello, this is a simple string intended for testing string algos. It has 100 characters without zer" , 0]
Average timming (in nanosecs): 3.46197153137555 ns



Btw... Same question i made for Nidud. How to make the Unicode version checks if the string being calculated is really Unicode or Not and make it returns 0-1, if it finds non-unicode chars while it is calculating the lenght ?
2
The Laboratory / Re: Faster Memcopy ...
« Last post by guga on Today at 03:05:45 PM »
Hi Nidud....The unicode version is working as expected :)

One question. How to implement a security inside the function to see either the string is really unicode or not, while it is calculating the lenght ?

I mean....say i have a bad unicode string like this:

[TestingUnicode: B$ 'H', 0, 'e', 0, 'hi', 0]

How to make the function checks the bad chars 'hi', and return a value of ... say 0-1 (meaning the function found an error ) ?

The RosAsm porting of your function, is like this:
Code: [Select]

Proc UnicodeStrlen:
    Arguments @Input
    Uses ecx, edx, edi ; <----preserve ecx, edx, edi registers - I benchmark the speed with all algos preserving the registers to be sure they are all behaving on the same way to see which one is faster etc under the very same conditions.

    mov eax D@Input
    bt eax 0 | jc L3>
    mov ecx D@Input
    and eax 0-16
    and ecx 16-1
    or  edx 0-1
    shl edx cl
    pxor xmm0 xmm0
    pcmpeqw xmm0 X$eax
    add eax 16
    pmovmskb ecx xmm0
    pxor xmm0 xmm0
    and  ecx edx
    jnz  L2>

L1:
    movups  xmm1 X$eax
    pcmpeqw xmm1 xmm0
    pmovmskb ecx xmm1
    add eax 16
    test ecx ecx
    jz  L1<
L2:
    bsf ecx ecx
    lea eax D$eax+ecx-16
    sub eax D@Input
    shr eax 1
    ExitP
L3:
    mov  edx edi
    mov  edi eax
    xor  eax eax
    or   ecx 0-1
    repne scasw
    not  ecx
    dec  ecx
    mov  eax ecx
    mov  edi edx

EndP

[code]
3
The Laboratory / Re: Faster Memcopy ...
« Last post by guga on Today at 02:22:37 PM »
Tks, guys, i´ll give a test for benchmarking the speed..

Nidud, about the memory issue...Indeed you are right. I was confused thinking the algo would write something to a protected area, but, it is not. It´s only computing the difference where to start calculating the lenght is that it ?  Tks for the unicode version. I´ll test ir right now

JJ...i tried to assemble your file, but masmbasic throw an error of configuration. I think i do have uasm on the same path as in ml.exe, but this is what is being showed to me:



How can i configure masmbasic properly to make it work on this test ?
4
The Laboratory / Re: Faster Memcopy ...
« Last post by nidud on Today at 12:31:13 PM »
Well, if you feed it with this string:

    align 16
    db 15 dup(0)
err db "error",0

the aligned (first) compare will be:
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 e | r r o r 0

xmm0 will be:
00FFFFFFFFFFFFFFFFFFFFFFFFFFFFFF

and there will be no jump to L2
5
The Laboratory / Re: Faster Memcopy ...
« Last post by jj2007 on Today at 11:32:06 AM »
Quote
I saw that but can pcmpeqb xmm0,[eax] produce a non-zero result without jumping to L2?

Yes. The alignment may go back 31 15 byte and they may all be zero.

I've tested it with a 100 byte string, increasing the pointer 100 times. "All zero" shouldn't be a problem. But I admit I am too tired right now to understand it ;-)

Here's my current version of your fast algo, 62 bytes short and considerably faster than the original:
Code: [Select]
Algo1 proc
  mov eax, [esp+4]
  mov ecx, eax ; much faster than [esp+4]
  and eax, -16
  and ecx, 16-1
  or edx, -1
  shl edx, cl
  xorps xmm0, xmm0 ; needed for short strings
  pcmpeqb xmm0, [eax]
  pmovmskb ecx, xmm0
;   xorps xmm0, xmm0 ; ??
  and ecx, edx ; short string?
  jnz L2
L1:
  add eax, 16
  movaps xmm1, [eax]
  pcmpeqb xmm1, xmm0
  pmovmskb ecx, xmm1
  test ecx, ecx
  jz L1
L2:
  bsf ecx, ecx
  add eax, ecx
  sub eax, [esp+4]
  retn 4
Algo1 endp
6
The Laboratory / Re: Faster Memcopy ...
« Last post by nidud on Today at 11:12:13 AM »
On my CPU the algo is over 2% faster with mov ecx, eax

The first move is slower but the next should (in theory) be the same. It's difficult to test this in a loop given it ends up in the cache after the first move. A simple test should then be more or less equal.

    mov eax,[esp+4]
    mov ecx,[esp+4]
    mov edx,[esp+4]
...
    mov eax,[esp+4]
    mov ecx,eax
    mov edx,eax
...
Intel(R) Core(TM) i5-6500T CPU @ 2.50GHz (AVX2)
----------------------------------------------
-- test(0)
    35942 cycles, rep(10000), code( 13) 0.asm: mov eax,[esp+4]
    35951 cycles, rep(10000), code(  9) 1.asm: mov eax,ecx
-- test(1)
    35058 cycles, rep(10000), code( 13) 0.asm: mov eax,[esp+4]
    36532 cycles, rep(10000), code(  9) 1.asm: mov eax,ecx
-- test(2)
    34778 cycles, rep(10000), code( 13) 0.asm: mov eax,[esp+4]
    35262 cycles, rep(10000), code(  9) 1.asm: mov eax,ecx

total [0 .. 2], 1++
   105778 cycles 0.asm: mov eax,[esp+4]
   107745 cycles 1.asm: mov eax,ecx

Quote
I saw that but can pcmpeqb xmm0,[eax] produce a non-zero result without jumping to L2?

Yes. The alignment may go back 31 15 byte and they may all be zero.
7
The Laboratory / Re: Faster Memcopy ...
« Last post by jj2007 on Today at 10:44:32 AM »
Code: [Select]
Algo1 proc
  mov eax,[esp+4]
  mov ecx,eax ; much faster than [esp+4]

Actually they are the same (speed-wise) but selected by size for alignment of L1.

On my CPU the algo is over 2% faster with mov ecx, eax

Quote

  ; xorps xmm0,xmm0   ; ??

This must be zero for compare below:

L1:
    movaps  xmm1,[eax]
    pcmpeqb xmm1,xmm0


I saw that but can pcmpeqb xmm0,[eax] produce a non-zero result without jumping to L2?
8
The Laboratory / Re: Faster Memcopy ...
« Last post by nidud on Today at 09:48:25 AM »
This is the Unicode version. Note that the string must be aligned 2 (which is mostly the case with Unicode strings) for this to work.

    mov     eax,[esp+4]
    bt      eax,0
    jc      L3
    mov     ecx,[esp+4]
    and     eax,-16
    and     ecx,16-1
    or      edx,-1
    shl     edx,cl
    pxor    xmm0,xmm0
    pcmpeqw xmm0,[eax]
    add     eax,16
    pmovmskb ecx,xmm0
    pxor    xmm0,xmm0
    and     ecx,edx
    jnz     L2
L1:
    movaps  xmm1,[eax]
    pcmpeqw xmm1,xmm0
    pmovmskb ecx,xmm1
    add     eax,16
    test    ecx,ecx
    jz      L1
L2:
    bsf     ecx,ecx
    lea     eax,[eax+ecx-16]
    sub     eax,[esp+4]
    shr     eax,1
    ret
L3:
    mov     edx,edi
    mov     edi,eax
    xor     eax,eax
    or      ecx,-1
    repne   scasw
    not     ecx
    dec     ecx
    mov     eax,ecx
    mov     edi,edx
    ret

Result:

total [0 .. 40], 8++
   575817 cycles 2.asm: SSE 16
  3081171 cycles 0.asm: msvcrt.wcslen()
  4261124 cycles 1.asm: scasw
total [41 .. 80], 7++
   629595 cycles 2.asm: SSE 16
  4696938 cycles 1.asm: scasw
  4742392 cycles 0.asm: msvcrt.wcslen()
total [600 .. 1000], 100++
   987251 cycles 2.asm: SSE 16
  7455315 cycles 1.asm: scasw
  8530590 cycles 0.asm: msvcrt.wcslen()
9
The Laboratory / Re: Faster Memcopy ...
« Last post by nidud on Today at 09:34:12 AM »
Code: [Select]
Algo1 proc
  mov eax,[esp+4]
  mov ecx,eax ; much faster than [esp+4]

Actually they are the same (speed-wise) but selected by size for alignment of L1.

  ; xorps xmm0,xmm0   ; ??

This must be zero for compare below:

L1:
    movaps  xmm1,[eax]
    pcmpeqb xmm1,xmm0
10
The Workshop / Re: memmove() challenge
« Last post by hutch-- on Today at 09:10:11 AM »
 :biggrin:

What a tangled mess which is also my observation of the original Microsoft source code.
Pages: [1] 2 3 ... 10