Author Topic: Faster Memcopy ...  (Read 32887 times)

guga

  • Member
  • *****
  • Posts: 1074
  • Assembly is a state of art.
    • RosAsm
Re: Faster Memcopy ...
« Reply #90 on: November 14, 2019, 08:55:06 PM »
Old, old, old post, but i was interesting on the timmigns. I made a small variation of Nidud´s StrLen. Seems fast and stable, but didn´t measured the results on masm. Can someone give a try ?

Code: [Select]
Proc FastStrlenTest:
    Arguments @Input
    Uses ecx, edx ; RosAsm macro to preserve registers. Just a push ecx+push edx after the start of the function (push ebp | mov ebp esp). Before the function ends at "endP' macro, the registers are restored with pop edx | pop ecx

    mov eax D@Input
    movdqu xmm0 X$eax
    xorps xmm1 xmm1
    pcmpeqb xmm0 xmm1
    pmovmskb ecx xmm0
    and ecx 0-1
    jne @done
        pmovmskb ecx xmm1
        shl ecx 16
        and ecx 0-1
    jne @done
    xorps xmm2 xmm2
@loop_32:

    lea eax D$eax+32
    movdqu xmm0 X$eax
    movdqu  xmm1 X$eax+16
    pcmpeqb xmm0 xmm2 | pmovmskb edx xmm0
    pcmpeqb xmm1 xmm2 | pmovmskb ecx xmm1
    add edx ecx
 jbe @loop_32
    shl ecx 16
    pmovmskb edx xmm0
    add ecx edx
@done:

 bsf ecx ecx
 lea eax D$ecx+eax
 sub eax D@Input

EndP

Note: I modified the beggining of Nidud´s algo, because the adress at the start, may result on a crash if the string is located exactly on the beginning of the memory addressing . That´s why i removed the initiual
Code: [Select]
    and ecx 32-1
    and eax 0-32

Btw...the fucntion seems to work on any size of the null terminated string...from 1 to XXXXX


Need to fully understand this, in order to try making a Unicode version too :)
Coding in Assembly requires a mix of:
80% of brain, passion, intuition, creativity
10% of programming skills
10% of alcoholic levels in your blood.

My Code Sites:
http://rosasm.freeforums.org
http://winasm.tripod.com

nidud

  • Member
  • *****
  • Posts: 1799
    • https://github.com/nidud/asmc
Re: Faster Memcopy ...
« Reply #91 on: November 14, 2019, 11:14:04 PM »
Note: I modified the beggining of Nidud´s algo, because the adress at the start, may result on a crash if the string is located exactly on the beginning of the memory addressing . That´s why i removed the initiual

Not sure this is possible and the benchmark also test for overlapping memory moves.

So did this actually happened or is it just a theory?

guga

  • Member
  • *****
  • Posts: 1074
  • Assembly is a state of art.
    • RosAsm
Re: Faster Memcopy ...
« Reply #92 on: November 15, 2019, 12:21:09 AM »
I saw that while i was debugging the function. It didn´t crashed, but the pointer entered on a previously allocated memory (28 bytes before the start of the input). It didn´t crashed, probably because the memory was correctly allocated, so it didn´t reached it´s beginning.

Ex: Say, the memory starts at 100000 And the function uses a pointer at the very 1st byte 100001. What could happen is that since the and operation will force the Inputted pointer to start at 100000-32 bytes (or something). And, if the area before that address is not allocated, most likely it will crash. However, it didn´t crashed on my tests, but i made that small adaptation just to make sure it won´t crash ever.

Btw,...it´s an amazing algorithm. I wonder if it can be optimized further. I made a small test comparing pcmpeqb xmm0 xmm1 rather then pcmpeqb xmm0 xmm2/pcmpeqb xmm1 xmm2 but couldn´t understand what this opcode is doing exactly yet.

The timmings are simply great and it do finds the len for small strings etc...while AxStrLenSSE2 version seems to have some errors on the results (at least, if i ported it properly)

Coding in Assembly requires a mix of:
80% of brain, passion, intuition, creativity
10% of programming skills
10% of alcoholic levels in your blood.

My Code Sites:
http://rosasm.freeforums.org
http://winasm.tripod.com

nidud

  • Member
  • *****
  • Posts: 1799
    • https://github.com/nidud/asmc
Re: Faster Memcopy ...
« Reply #93 on: November 15, 2019, 02:22:04 AM »
I saw that while i was debugging the function. It didn´t crashed, but the pointer entered on a previously allocated memory (28 bytes before the start of the input). It didn´t crashed, probably because the memory was correctly allocated, so it didn´t reached it´s beginning.

Ex: Say, the memory starts at 100000 And the function uses a pointer at the very 1st byte 100001. What could happen is that since the and operation will force the Inputted pointer to start at 100000-32 bytes (or something). And, if the area before that address is not allocated, most likely it will crash. However, it didn´t crashed on my tests, but i made that small adaptation just to make sure it won´t crash ever.

Memory is page-aligned so it never starts at an uneven address. The problem is more about the end as addressed in this tread. The situation may be explained like this:

0              1             2
[PAGE_NOACCESS][current page][PAGE_NOACCESS]

Assuming the address is 0x1000 (start of current page) the alignment code will have no effect ((0x1000 & -32) = 0x1000) so page 0 will never be accessed. However, if the string is two byte at address 0x1FFD and the alignment code is removed you will start by reading 30 bytes of page 2. To avoid this the address is set to 0x2000-32 and a bit-mask is added for valid bytes (two upper bits in this case) in the first read.

Quote
I made a small test comparing pcmpeqb xmm0 xmm1 rather then pcmpeqb xmm0 xmm2/pcmpeqb xmm1 xmm2 but couldn´t understand what this opcode is doing exactly yet.

It compares 16 bytes and sets each equals to FF in the first operand. pmovmskb convert the result to 16 bits.

Quote
The timmings are simply great and it do finds the len for small strings etc...while AxStrLenSSE2 version seems to have some errors on the results (at least, if i ported it properly)

It's an old test-bed and it wont assemble with the current version of Asmc so it needs an update.
« Last Edit: November 15, 2019, 05:50:09 AM by nidud »

nidud

  • Member
  • *****
  • Posts: 1799
    • https://github.com/nidud/asmc
Re: Faster Memcopy ...
« Reply #94 on: November 15, 2019, 05:49:03 AM »
I updated the 32-bit benchmark test.

A 16 byte version seems better, at least on the machine I use now.

    mov     eax,[esp+4]
    mov     ecx,[esp+4]
    and     eax,-16
    and     ecx,16-1
    or      edx,-1
    shl     edx,cl
    xorps   xmm0,xmm0
    pcmpeqb xmm0,[eax]
    add     eax,16
    pmovmskb ecx,xmm0
    xorps   xmm0,xmm0
    and     ecx,edx
    jnz     L2
L1:
    movaps  xmm1,[eax]
    pcmpeqb xmm1,xmm0
    pmovmskb ecx,xmm1
    add     eax,16
    test    ecx,ecx
    jz      L1
L2:
    bsf     ecx,ecx
    lea     eax,[eax+ecx-16]
    sub     eax,[esp+4]
    ret

bench:

total [0 .. 40], 8++
   339601 cycles 3.asm: SSE Intel Silvermont
   370809 cycles 1.asm: SSE 16
   465072 cycles 2.asm: SSE 32
   756379 cycles 4.asm: SSE Intel Atom
   853326 cycles 0.asm: msvcrt.strlen()
total [41 .. 80], 7++
   301950 cycles 3.asm: SSE Intel Silvermont
   327903 cycles 1.asm: SSE 16
   423696 cycles 2.asm: SSE 32
   649882 cycles 4.asm: SSE Intel Atom
  1262412 cycles 0.asm: msvcrt.strlen()
total [600 .. 1000], 100++
   239360 cycles 3.asm: SSE Intel Silvermont
   308941 cycles 4.asm: SSE Intel Atom
   481037 cycles 2.asm: SSE 32
   485014 cycles 1.asm: SSE 16
  2005072 cycles 0.asm: msvcrt.strlen()

jj2007

  • Member
  • *****
  • Posts: 9794
  • Assembler is fun ;-)
    • MasmBasic
Re: Faster Memcopy ...
« Reply #95 on: November 15, 2019, 08:52:04 AM »
Here is a test run with your (marginally modified) 1.asm ... 4.asm and other StrLen algos.
The 3.asm algo returns a wrong value, I have not found out why.

Code: [Select]
Intel(R) Core(TM) i5-2450M CPU @ 2.50GHz (SSE4)

6931    cycles for 100 * CRT strlen
5372    cycles for 100 * Masm32 StrLen
19701   cycles for 100 * Windows lstrlen
2006    cycles for 100 * MasmBasic Len
1579    cycles for 100 * Algo1
1712    cycles for 100 * Algo2
1103    cycles for 100 * Algo3
3209    cycles for 100 * Algo4

6941    cycles for 100 * CRT strlen
5393    cycles for 100 * Masm32 StrLen
19722   cycles for 100 * Windows lstrlen
1994    cycles for 100 * MasmBasic Len
1587    cycles for 100 * Algo1
1722    cycles for 100 * Algo2
1102    cycles for 100 * Algo3
3227    cycles for 100 * Algo4

6938    cycles for 100 * CRT strlen
5380    cycles for 100 * Masm32 StrLen
19692   cycles for 100 * Windows lstrlen
2011    cycles for 100 * MasmBasic Len
1580    cycles for 100 * Algo1
1713    cycles for 100 * Algo2
1101    cycles for 100 * Algo3
3205    cycles for 100 * Algo4

6970    cycles for 100 * CRT strlen
5404    cycles for 100 * Masm32 StrLen
19710   cycles for 100 * Windows lstrlen
2010    cycles for 100 * MasmBasic Len
1589    cycles for 100 * Algo1
1710    cycles for 100 * Algo2
1104    cycles for 100 * Algo3
3204    cycles for 100 * Algo4

14      bytes for CRT strlen
10      bytes for Masm32 StrLen
10      bytes for Windows lstrlen
10      bytes for MasmBasic Len
82      bytes for Algo1
114     bytes for Algo2
690     bytes for Algo3
761     bytes for Algo4

100     = eax CRT strlen
100     = eax Masm32 StrLen
100     = eax Windows lstrlen
100     = eax MasmBasic Len
100     = eax Algo1
100     = eax Algo2
14      = eax Algo3
100     = eax Algo4

This is the clear winner:
Code: [Select]
; include 1j.asm
Algo1 proc
  mov eax,[esp+4]
  mov ecx,eax ; much faster than [esp+4]
  and eax,-16
  and ecx,16-1
  or edx,-1
  shl edx,cl
  xorps xmm0,xmm0
  pcmpeqb xmm0,[eax]
  add eax,16
  pmovmskb ecx,xmm0
  ; xorps xmm0,xmm0 ; ??
  and ecx,edx
  jnz L2
L1:
  movaps xmm1,[eax]
  pcmpeqb xmm1,xmm0
  pmovmskb ecx,xmm1
  add eax,16
  test ecx,ecx
  jz L1
L2:
  bsf ecx,ecx
  lea eax,[eax+ecx-16]
  sub eax,[esp+4]
  retn 4
Algo1 endp

nidud

  • Member
  • *****
  • Posts: 1799
    • https://github.com/nidud/asmc
Re: Faster Memcopy ...
« Reply #96 on: November 15, 2019, 09:34:12 AM »
Code: [Select]
Algo1 proc
  mov eax,[esp+4]
  mov ecx,eax ; much faster than [esp+4]

Actually they are the same (speed-wise) but selected by size for alignment of L1.

  ; xorps xmm0,xmm0   ; ??

This must be zero for compare below:

L1:
    movaps  xmm1,[eax]
    pcmpeqb xmm1,xmm0

nidud

  • Member
  • *****
  • Posts: 1799
    • https://github.com/nidud/asmc
Re: Faster Memcopy ...
« Reply #97 on: November 15, 2019, 09:48:25 AM »
This is the Unicode version. Note that the string must be aligned 2 (which is mostly the case with Unicode strings) for this to work.

    mov     eax,[esp+4]
    bt      eax,0
    jc      L3
    mov     ecx,[esp+4]
    and     eax,-16
    and     ecx,16-1
    or      edx,-1
    shl     edx,cl
    pxor    xmm0,xmm0
    pcmpeqw xmm0,[eax]
    add     eax,16
    pmovmskb ecx,xmm0
    pxor    xmm0,xmm0
    and     ecx,edx
    jnz     L2
L1:
    movaps  xmm1,[eax]
    pcmpeqw xmm1,xmm0
    pmovmskb ecx,xmm1
    add     eax,16
    test    ecx,ecx
    jz      L1
L2:
    bsf     ecx,ecx
    lea     eax,[eax+ecx-16]
    sub     eax,[esp+4]
    shr     eax,1
    ret
L3:
    mov     edx,edi
    mov     edi,eax
    xor     eax,eax
    or      ecx,-1
    repne   scasw
    not     ecx
    dec     ecx
    mov     eax,ecx
    mov     edi,edx
    ret

Result:

total [0 .. 40], 8++
   575817 cycles 2.asm: SSE 16
  3081171 cycles 0.asm: msvcrt.wcslen()
  4261124 cycles 1.asm: scasw
total [41 .. 80], 7++
   629595 cycles 2.asm: SSE 16
  4696938 cycles 1.asm: scasw
  4742392 cycles 0.asm: msvcrt.wcslen()
total [600 .. 1000], 100++
   987251 cycles 2.asm: SSE 16
  7455315 cycles 1.asm: scasw
  8530590 cycles 0.asm: msvcrt.wcslen()

jj2007

  • Member
  • *****
  • Posts: 9794
  • Assembler is fun ;-)
    • MasmBasic
Re: Faster Memcopy ...
« Reply #98 on: November 15, 2019, 10:44:32 AM »
Code: [Select]
Algo1 proc
  mov eax,[esp+4]
  mov ecx,eax ; much faster than [esp+4]

Actually they are the same (speed-wise) but selected by size for alignment of L1.

On my CPU the algo is over 2% faster with mov ecx, eax

Quote

  ; xorps xmm0,xmm0   ; ??

This must be zero for compare below:

L1:
    movaps  xmm1,[eax]
    pcmpeqb xmm1,xmm0


I saw that but can pcmpeqb xmm0,[eax] produce a non-zero result without jumping to L2?

nidud

  • Member
  • *****
  • Posts: 1799
    • https://github.com/nidud/asmc
Re: Faster Memcopy ...
« Reply #99 on: November 15, 2019, 11:12:13 AM »
On my CPU the algo is over 2% faster with mov ecx, eax

The first move is slower but the next should (in theory) be the same. It's difficult to test this in a loop given it ends up in the cache after the first move. A simple test should then be more or less equal.

    mov eax,[esp+4]
    mov ecx,[esp+4]
    mov edx,[esp+4]
...
    mov eax,[esp+4]
    mov ecx,eax
    mov edx,eax
...
Intel(R) Core(TM) i5-6500T CPU @ 2.50GHz (AVX2)
----------------------------------------------
-- test(0)
    35942 cycles, rep(10000), code( 13) 0.asm: mov eax,[esp+4]
    35951 cycles, rep(10000), code(  9) 1.asm: mov eax,ecx
-- test(1)
    35058 cycles, rep(10000), code( 13) 0.asm: mov eax,[esp+4]
    36532 cycles, rep(10000), code(  9) 1.asm: mov eax,ecx
-- test(2)
    34778 cycles, rep(10000), code( 13) 0.asm: mov eax,[esp+4]
    35262 cycles, rep(10000), code(  9) 1.asm: mov eax,ecx

total [0 .. 2], 1++
   105778 cycles 0.asm: mov eax,[esp+4]
   107745 cycles 1.asm: mov eax,ecx

Quote
I saw that but can pcmpeqb xmm0,[eax] produce a non-zero result without jumping to L2?

Yes. The alignment may go back 31 15 byte and they may all be zero.

jj2007

  • Member
  • *****
  • Posts: 9794
  • Assembler is fun ;-)
    • MasmBasic
Re: Faster Memcopy ...
« Reply #100 on: November 15, 2019, 11:32:06 AM »
Quote
I saw that but can pcmpeqb xmm0,[eax] produce a non-zero result without jumping to L2?

Yes. The alignment may go back 31 15 byte and they may all be zero.

I've tested it with a 100 byte string, increasing the pointer 100 times. "All zero" shouldn't be a problem. But I admit I am too tired right now to understand it ;-)

Here's my current version of your fast algo, 62 bytes short and considerably faster than the original:
Code: [Select]
Algo1 proc
  mov eax, [esp+4]
  mov ecx, eax ; much faster than [esp+4]
  and eax, -16
  and ecx, 16-1
  or edx, -1
  shl edx, cl
  xorps xmm0, xmm0 ; needed for short strings
  pcmpeqb xmm0, [eax]
  pmovmskb ecx, xmm0
;   xorps xmm0, xmm0 ; ??
  and ecx, edx ; short string?
  jnz L2
L1:
  add eax, 16
  movaps xmm1, [eax]
  pcmpeqb xmm1, xmm0
  pmovmskb ecx, xmm1
  test ecx, ecx
  jz L1
L2:
  bsf ecx, ecx
  add eax, ecx
  sub eax, [esp+4]
  retn 4
Algo1 endp

nidud

  • Member
  • *****
  • Posts: 1799
    • https://github.com/nidud/asmc
Re: Faster Memcopy ...
« Reply #101 on: November 15, 2019, 12:31:13 PM »
Well, if you feed it with this string:

    align 16
    db 15 dup(0)
err db "error",0

the aligned (first) compare will be:
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 e | r r o r 0

xmm0 will be:
00FFFFFFFFFFFFFFFFFFFFFFFFFFFFFF

and there will be no jump to L2

guga

  • Member
  • *****
  • Posts: 1074
  • Assembly is a state of art.
    • RosAsm
Re: Faster Memcopy ...
« Reply #102 on: November 15, 2019, 02:22:37 PM »
Tks, guys, i´ll give a test for benchmarking the speed..

Nidud, about the memory issue...Indeed you are right. I was confused thinking the algo would write something to a protected area, but, it is not. It´s only computing the difference where to start calculating the lenght is that it ?  Tks for the unicode version. I´ll test ir right now

JJ...i tried to assemble your file, but masmbasic throw an error of configuration. I think i do have uasm on the same path as in ml.exe, but this is what is being showed to me:



How can i configure masmbasic properly to make it work on this test ?
Coding in Assembly requires a mix of:
80% of brain, passion, intuition, creativity
10% of programming skills
10% of alcoholic levels in your blood.

My Code Sites:
http://rosasm.freeforums.org
http://winasm.tripod.com

guga

  • Member
  • *****
  • Posts: 1074
  • Assembly is a state of art.
    • RosAsm
Re: Faster Memcopy ...
« Reply #103 on: November 15, 2019, 03:05:45 PM »
Hi Nidud....The unicode version is working as expected :)

One question. How to implement a security inside the function to see either the string is really unicode or not, while it is calculating the lenght ?

I mean....say i have a bad unicode string like this:

[TestingUnicode: B$ 'H', 0, 'e', 0, 'hi', 0]

How to make the function checks the bad chars 'hi', and return a value of ... say 0-1 (meaning the function found an error ) ?

The RosAsm porting of your function, is like this:
Code: [Select]

Proc UnicodeStrlen:
    Arguments @Input
    Uses ecx, edx, edi ; <----preserve ecx, edx, edi registers - I benchmark the speed with all algos preserving the registers to be sure they are all behaving on the same way to see which one is faster etc under the very same conditions.

    mov eax D@Input
    bt eax 0 | jc L3>
    mov ecx D@Input
    and eax 0-16
    and ecx 16-1
    or  edx 0-1
    shl edx cl
    pxor xmm0 xmm0
    pcmpeqw xmm0 X$eax
    add eax 16
    pmovmskb ecx xmm0
    pxor xmm0 xmm0
    and  ecx edx
    jnz  L2>

L1:
    movups  xmm1 X$eax
    pcmpeqw xmm1 xmm0
    pmovmskb ecx xmm1
    add eax 16
    test ecx ecx
    jz  L1<
L2:
    bsf ecx ecx
    lea eax D$eax+ecx-16
    sub eax D@Input
    shr eax 1
    ExitP
L3:
    mov  edx edi
    mov  edi eax
    xor  eax eax
    or   ecx 0-1
    repne scasw
    not  ecx
    dec  ecx
    mov  eax ecx
    mov  edi edx

EndP

[code]
Coding in Assembly requires a mix of:
80% of brain, passion, intuition, creativity
10% of programming skills
10% of alcoholic levels in your blood.

My Code Sites:
http://rosasm.freeforums.org
http://winasm.tripod.com

guga

  • Member
  • *****
  • Posts: 1074
  • Assembly is a state of art.
    • RosAsm
Re: Faster Memcopy ...
« Reply #104 on: November 15, 2019, 03:26:18 PM »
ABout the Unicode version. JJ, i ported yours to work with Unicode as well, and it seem to work ok, and it is faster on my machine. Can you give a test on your benchmark function to compare the speeds, pls ?

JJ Unicode Version
Code: [Select]
Proc UniStrLenJJ:
    Arguments @Input
    Uses ecx, edx ; <--- preserved register
   
    mov eax D@Input
    mov ecx eax ; much faster than [esp+4]
    and eax 0-16
    and ecx 16-1
    or edx 0-1
    shl edx cl
    xorps xmm0 xmm0
    pcmpeqw xmm0 X$eax
    add eax 16
    pmovmskb ecx xmm0
    xorps xmm0 xmm0
    and ecx edx
    jnz L2>
L1:
    movups xmm1 X$eax ; <---- Changed to unaligned version on both unicode and ansi version. A bit faster on my machine, and prevent crashing on unaligned strings calculation)
    pcmpeqw xmm1 xmm0
    pmovmskb ecx xmm1
    add eax 16
    test ecx ecx
    jz L1<
L2:
    bsf ecx ecx
    lea eax D$eax+ecx-16
    sub eax D@Input
    shr eax 1

EndP

JJ Ansi version
Code: [Select]
Proc StrLenJJ:
    Arguments @Input
    Uses ecx, edx
   
    mov eax D@Input
    mov ecx eax ; much faster than [esp+4]
    and eax 0-16
    and ecx 16-1
    or edx 0-1
    shl edx cl
    xorps xmm0 xmm0
    pcmpeqb xmm0 X$eax
    add eax 16
    pmovmskb ecx xmm0
    xorps xmm0 xmm0
    and ecx edx
    jnz L2>
L1:
    movups xmm1 X$eax
    pcmpeqb xmm1 xmm0
    pmovmskb ecx xmm1
    add eax 16
    test ecx ecx
    jz L1<
L2:
    bsf ecx,ecx
    lea eax D$eax+ecx-16
    sub eax D@Input

EndP

On my tests, in terms of speed the Ansi version and Unicode version does not varies that much in speed

Ansi version usign the following string (99 chars...I´m testing odd and even strings, big or tiny as 1 byte only):
[TestAnsi: B$ "Hello, this is a simple string intended for testing string algos. It has 100 characters without zer" , 0]
Average timming (in nanosecs): 2.95913440552019 ns

Unicode version
[TestUnicode: U$ "Hello, this is a simple string intended for testing string algos. It has 100 characters without zer" , 0]
Average timming (in nanosecs): 3.46197153137555 ns



Btw... Same question i made for Nidud. How to make the Unicode version checks if the string being calculated is really Unicode or Not and make it returns 0-1, if it finds non-unicode chars while it is calculating the lenght ?
Coding in Assembly requires a mix of:
80% of brain, passion, intuition, creativity
10% of programming skills
10% of alcoholic levels in your blood.

My Code Sites:
http://rosasm.freeforums.org
http://winasm.tripod.com