Faster Memcopy ...

guga · November 14, 2019, 08:55:06 PM

Old, old, old post, but i was interesting on the timmigns. I made a small variation of Nidud´s StrLen. Seems fast and stable, but didn´t measured the results on masm. Can someone give a try ?

Code Select

Proc FastStrlenTest:
    Arguments @Input
    Uses ecx, edx ; RosAsm macro to preserve registers. Just a push ecx+push edx after the start of the function (push ebp | mov ebp esp). Before the function ends at "endP' macro, the registers are restored with pop edx | pop ecx

    mov eax D@Input
    movdqu xmm0 X$eax
    xorps xmm1 xmm1
    pcmpeqb xmm0 xmm1
    pmovmskb ecx xmm0
    and ecx 0-1
    jne @done
        pmovmskb ecx xmm1
        shl ecx 16
        and ecx 0-1
    jne @done
    xorps xmm2 xmm2
@loop_32:

    lea eax D$eax+32
    movdqu xmm0 X$eax
    movdqu  xmm1 X$eax+16
    pcmpeqb xmm0 xmm2 | pmovmskb edx xmm0
    pcmpeqb xmm1 xmm2 | pmovmskb ecx xmm1
    add edx ecx
 jbe @loop_32
    shl ecx 16
    pmovmskb edx xmm0
    add ecx edx
@done:

 bsf ecx ecx
 lea eax D$ecx+eax
 sub eax D@Input

EndP

Note: I modified the beggining of Nidud´s algo, because the adress at the start, may result on a crash if the string is located exactly on the beginning of the memory addressing . That´s why i removed the initiual

Code Select

    and ecx 32-1
    and eax 0-32

Btw...the fucntion seems to work on any size of the null terminated string...from 1 to XXXXX

Need to fully understand this, in order to try making a Unicode version too :)

nidud · November 14, 2019, 11:14:04 PM

deleted

guga · November 15, 2019, 12:21:09 AM

I saw that while i was debugging the function. It didn´t crashed, but the pointer entered on a previously allocated memory (28 bytes before the start of the input). It didn´t crashed, probably because the memory was correctly allocated, so it didn´t reached it´s beginning.

Ex: Say, the memory starts at 100000 And the function uses a pointer at the very 1st byte 100001. What could happen is that since the and operation will force the Inputted pointer to start at 100000-32 bytes (or something). And, if the area before that address is not allocated, most likely it will crash. However, it didn´t crashed on my tests, but i made that small adaptation just to make sure it won´t crash ever.

Btw,...it´s an amazing algorithm. I wonder if it can be optimized further. I made a small test comparing pcmpeqb xmm0 xmm1 rather then pcmpeqb xmm0 xmm2/pcmpeqb xmm1 xmm2 but couldn´t understand what this opcode is doing exactly yet.

The timmings are simply great and it do finds the len for small strings etc...while AxStrLenSSE2 version seems to have some errors on the results (at least, if i ported it properly)

nidud · November 15, 2019, 02:22:04 AM

deleted

nidud · November 15, 2019, 05:49:03 AM

deleted

jj2007 · November 15, 2019, 08:52:04 AM

Here is a test run with your (marginally modified) 1.asm ... 4.asm and other StrLen algos.
The 3.asm algo returns a wrong value, I have not found out why.

Code Select

Intel(R) Core(TM) i5-2450M CPU @ 2.50GHz (SSE4)

6931    cycles for 100 * CRT strlen
5372    cycles for 100 * Masm32 StrLen
19701   cycles for 100 * Windows lstrlen
2006    cycles for 100 * MasmBasic Len
1579    cycles for 100 * Algo1
1712    cycles for 100 * Algo2
1103    cycles for 100 * Algo3
3209    cycles for 100 * Algo4

6941    cycles for 100 * CRT strlen
5393    cycles for 100 * Masm32 StrLen
19722   cycles for 100 * Windows lstrlen
1994    cycles for 100 * MasmBasic Len
1587    cycles for 100 * Algo1
1722    cycles for 100 * Algo2
1102    cycles for 100 * Algo3
3227    cycles for 100 * Algo4

6938    cycles for 100 * CRT strlen
5380    cycles for 100 * Masm32 StrLen
19692   cycles for 100 * Windows lstrlen
2011    cycles for 100 * MasmBasic Len
1580    cycles for 100 * Algo1
1713    cycles for 100 * Algo2
1101    cycles for 100 * Algo3
3205    cycles for 100 * Algo4

6970    cycles for 100 * CRT strlen
5404    cycles for 100 * Masm32 StrLen
19710   cycles for 100 * Windows lstrlen
2010    cycles for 100 * MasmBasic Len
1589    cycles for 100 * Algo1
1710    cycles for 100 * Algo2
1104    cycles for 100 * Algo3
3204    cycles for 100 * Algo4

14      bytes for CRT strlen
10      bytes for Masm32 StrLen
10      bytes for Windows lstrlen
10      bytes for MasmBasic Len
82      bytes for Algo1
114     bytes for Algo2
690     bytes for Algo3
761     bytes for Algo4

100     = eax CRT strlen
100     = eax Masm32 StrLen
100     = eax Windows lstrlen
100     = eax MasmBasic Len
100     = eax Algo1
100     = eax Algo2
14      = eax Algo3
100     = eax Algo4

This is the clear winner:

Code Select

; include 1j.asm
Algo1 proc
  mov eax,[esp+4]
  mov ecx,eax	; much faster than [esp+4]
  and eax,-16
  and ecx,16-1
  or edx,-1
  shl edx,cl
  xorps xmm0,xmm0
  pcmpeqb xmm0,[eax]
  add eax,16
  pmovmskb ecx,xmm0
  ; xorps xmm0,xmm0	; ??
  and ecx,edx
  jnz L2
L1:
  movaps xmm1,[eax]
  pcmpeqb xmm1,xmm0
  pmovmskb ecx,xmm1
  add eax,16
  test ecx,ecx
  jz L1
L2:
  bsf ecx,ecx
  lea eax,[eax+ecx-16]
  sub eax,[esp+4]
  retn 4
Algo1 endp

nidud · November 15, 2019, 09:34:12 AM

deleted

nidud · November 15, 2019, 09:48:25 AM

deleted

jj2007 · November 15, 2019, 10:44:32 AM

Quote from: nidud on November 15, 2019, 09:34:12 AM
Quote from: jj2007 on November 15, 2019, 08:52:04 AM
Code Select Expand
Algo1 proc mov eax,[esp+4] mov ecx,eax ; much faster than [esp+4]

Actually they are the same (speed-wise) but selected by size for alignment of L1.

On my CPU the algo is over 2% faster with mov ecx, eax

Quote

; xorps xmm0,xmm0 ; ??

This must be zero for compare below:

L1:
movaps xmm1,[eax]
pcmpeqb xmm1,xmm0

I saw that but can pcmpeqb xmm0,[eax] produce a non-zero result without jumping to L2?

nidud · November 15, 2019, 11:12:13 AM

deleted

jj2007 · November 15, 2019, 11:32:06 AM

Quote from: nidud on November 15, 2019, 11:12:13 AM
QuoteI saw that but can pcmpeqb xmm0,[eax] produce a non-zero result without jumping to L2?

Yes. The alignment may go back 31 15 byte and they may all be zero.

I've tested it with a 100 byte string, increasing the pointer 100 times. "All zero" shouldn't be a problem. But I admit I am too tired right now to understand it ;-)

Here's my current version of your fast algo, 62 bytes short and considerably faster than the original:

Code Select

Algo1 proc
  mov eax, [esp+4]
  mov ecx, eax	; much faster than [esp+4]
  and eax, -16
  and ecx, 16-1
  or edx, -1
  shl edx, cl
  xorps xmm0, xmm0	; needed for short strings
  pcmpeqb xmm0, [eax]
  pmovmskb ecx, xmm0
;   xorps xmm0, xmm0	; ??
  and ecx, edx	; short string?
  jnz L2
L1:
  add eax, 16
  movaps xmm1, [eax]
  pcmpeqb xmm1, xmm0
  pmovmskb ecx, xmm1
  test ecx, ecx
  jz L1
L2:
  bsf ecx, ecx
  add eax, ecx
  sub eax, [esp+4]
  retn 4
Algo1 endp

nidud · November 15, 2019, 12:31:13 PM

deleted

guga · November 15, 2019, 02:22:37 PM

Tks, guys, i´ll give a test for benchmarking the speed..

Nidud, about the memory issue...Indeed you are right. I was confused thinking the algo would write something to a protected area, but, it is not. It´s only computing the difference where to start calculating the lenght is that it ? Tks for the unicode version. I´ll test ir right now

JJ...i tried to assemble your file, but masmbasic throw an error of configuration. I think i do have uasm on the same path as in ml.exe, but this is what is being showed to me:

How can i configure masmbasic properly to make it work on this test ?

guga · November 15, 2019, 03:05:45 PM

Hi Nidud....The unicode version is working as expected :)

One question. How to implement a security inside the function to see either the string is really unicode or not, while it is calculating the lenght ?

I mean....say i have a bad unicode string like this:

[TestingUnicode: B$ 'H', 0, 'e', 0, 'hi', 0]

How to make the function checks the bad chars 'hi', and return a value of ... say 0-1 (meaning the function found an error ) ?

The RosAsm porting of your function, is like this:

Code Select



Proc UnicodeStrlen:
    Arguments @Input
    Uses ecx, edx, edi ; <----preserve ecx, edx, edi registers - I benchmark the speed with all algos preserving the registers to be sure they are all behaving on the same way to see which one is faster etc under the very same conditions.

    mov eax D@Input
    bt eax 0 | jc L3>
    mov ecx D@Input
    and eax 0-16
    and ecx 16-1
    or  edx 0-1
    shl edx cl
    pxor xmm0 xmm0
    pcmpeqw xmm0 X$eax
    add eax 16
    pmovmskb ecx xmm0
    pxor xmm0 xmm0
    and  ecx edx
    jnz  L2>

L1:
    movups  xmm1 X$eax
    pcmpeqw xmm1 xmm0
    pmovmskb ecx xmm1
    add eax 16
    test ecx ecx
    jz  L1<
L2:
    bsf ecx ecx
    lea eax D$eax+ecx-16
    sub eax D@Input
    shr eax 1
    ExitP
L3:
    mov  edx edi
    mov  edi eax
    xor  eax eax
    or   ecx 0-1
    repne scasw
    not  ecx
    dec  ecx
    mov  eax ecx
    mov  edi edx

EndP

[code]

guga · November 15, 2019, 03:26:18 PM

ABout the Unicode version. JJ, i ported yours to work with Unicode as well, and it seem to work ok, and it is faster on my machine. Can you give a test on your benchmark function to compare the speeds, pls ?

JJ Unicode Version

Code Select


Proc UniStrLenJJ:
    Arguments @Input
    Uses ecx, edx ; <--- preserved register
    
    mov eax D@Input
    mov ecx eax ; much faster than [esp+4]
    and eax 0-16
    and ecx 16-1
    or edx 0-1
    shl edx cl
    xorps xmm0 xmm0
    pcmpeqw xmm0 X$eax
    add eax 16
    pmovmskb ecx xmm0
    xorps xmm0 xmm0
    and ecx edx
    jnz L2>
L1:
    movups xmm1 X$eax ; <---- Changed to unaligned version on both unicode and ansi version. A bit faster on my machine, and prevent crashing on unaligned strings calculation)
    pcmpeqw xmm1 xmm0
    pmovmskb ecx xmm1
    add eax 16
    test ecx ecx
    jz L1<
L2:
    bsf ecx ecx
    lea eax D$eax+ecx-16
    sub eax D@Input
    shr eax 1

EndP

JJ Ansi version

Code Select


Proc StrLenJJ:
    Arguments @Input
    Uses ecx, edx
    
    mov eax D@Input
    mov ecx eax ; much faster than [esp+4]
    and eax 0-16
    and ecx 16-1
    or edx 0-1
    shl edx cl
    xorps xmm0 xmm0
    pcmpeqb xmm0 X$eax
    add eax 16
    pmovmskb ecx xmm0
    xorps xmm0 xmm0
    and ecx edx
    jnz L2>
L1:
    movups xmm1 X$eax
    pcmpeqb xmm1 xmm0
    pmovmskb ecx xmm1
    add eax 16
    test ecx ecx
    jz L1<
L2:
    bsf ecx,ecx
    lea eax D$eax+ecx-16
    sub eax D@Input

EndP

On my tests, in terms of speed the Ansi version and Unicode version does not varies that much in speed

Ansi version usign the following string (99 chars...I´m testing odd and even strings, big or tiny as 1 byte only):
[TestAnsi: B$ "Hello, this is a simple string intended for testing string algos. It has 100 characters without zer" , 0]
Average timming (in nanosecs): 2.95913440552019 ns

Unicode version
[TestUnicode: U$ "Hello, this is a simple string intended for testing string algos. It has 100 characters without zer" , 0]
Average timming (in nanosecs): 3.46197153137555 ns

Btw... Same question i made for Nidud. How to make the Unicode version checks if the string being calculated is really Unicode or Not and make it returns 0-1, if it finds non-unicode chars while it is calculating the lenght ?

The MASM Forum

News:

Faster Memcopy ...

guga

nidud

guga

nidud

nidud

jj2007

nidud

nidud

jj2007

nidud

jj2007

nidud

guga

guga

guga