Author Topic: Faster Memcopy ...  (Read 39909 times)

jj2007

  • Member
  • *****
  • Posts: 10438
  • Assembler is fun ;-)
    • MasmBasic
Re: Faster Memcopy ...
« Reply #105 on: November 15, 2019, 08:56:43 PM »
Well, if you feed it with this string:

    align 16
    db 15 dup(0)
err db "error",0

the aligned (first) compare will be:
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 e | r r o r 0

xmm0 will be:
00FFFFFFFFFFFFFFFFFFFFFFFFFFFFFF

and there will be no jump to L2

Correct :thumbsup:

ABout the Unicode version. JJ, i ported yours to work with Unicode as well, and it seem to work ok, and it is faster on my machine. Can you give a test on your benchmark function to compare the speeds, pls ?

If you translate it to Masm syntax, I can do that  :thup:

guga

  • Member
  • *****
  • Posts: 1194
  • Assembly is a state of art.
    • RosAsm
Re: Faster Memcopy ...
« Reply #106 on: November 15, 2019, 09:43:21 PM »
Hii JJ. Tks :)



I believe the porting to masm, is something like this:

Code: [Select]

UniStrLenJJ proc Input :DWORD
             ; <--- uses ebp, rather then esp. Here should have a push ebp | mov  ebp, esp at start. Forgot how to include that. Maybe using invoke token in masm, right ?
    push ecx
    push edx

    mov eax, Input
    mov ecx, eax
    and eax, -16
    and ecx, 16-1
    or edx, -1
    shl edx, cl
    xorps xmm0, xmm0
    pcmpeqw xmm0, [eax] ; <---- or it is  xmmword ptr [eax]. Don´t remember if there is a xmmword instruction in masm. It´s the same as in Algo1, but using pcmpeqw rather then pcmpeqb
    add eax, 16
    pmovmskb ecx, xmm0
    xorps xmm0, xmm0
    and ecx, edx
    jnz short Out1

InnerLoop:
     movups  xmm1, [eax]; <---- or it is  xmmword ptr [eax]. Don´t remember if there is a xmmword instruction in masm
     pcmpeqw xmm1, xmm0
     pmovmskb ecx, xmm1
     add  eax, 16
     test ecx, ecx
     jz short InnerLoop

Out1:
     bsf ecx, ecx
     lea eax, [ecx+eax-16]
     sub eax, Input
     shr eax, 1

     pop edx
     pop ecx
     retn 4 <----- or a simple ret..Don´t remember the syntax. before the ret should have a mov esp, ebp | pop ebp instructions
UniStrLenJJ     endp
Coding in Assembly requires a mix of:
80% of brain, passion, intuition, creativity
10% of programming skills
10% of alcoholic levels in your blood.

My Code Sites:
http://rosasm.freeforums.org
http://winasm.tripod.com

nidud

  • Member
  • *****
  • Posts: 1972
    • https://github.com/nidud/asmc
Re: Faster Memcopy ...
« Reply #107 on: November 16, 2019, 12:15:47 AM »
Nidud, about the memory issue...Indeed you are right. I was confused thinking the algo would write something to a protected area, but, it is not.

Alignment of data is normally used to increase performance but in this case it also prevent the function from reading (ahead) into protected memory, so if you remove the alignment the function will blow up.

Quote
It´s only computing the difference where to start calculating the lenght is that it ?

In addition to increase performance yes. Consider the "hello" string at the end of the buffer moved/compared to xmm0. The valid bytes in the aligned block is A to F.

 0 0 0 0 0 0 0 0 0 0 h e l l o 0|p r o t e c t e d 0
                    [0 1 2 3 4 5 6 7 8 9 A B C D E F]
[0 1 2 3 4 5 6 7 8 9 A B C D E F]

One question. How to implement a security inside the function to see either the string is really unicode or not, while it is calculating the lenght ?

I mean....say i have a bad unicode string like this:

[TestingUnicode: B$ 'H', 0, 'e', 0, 'hi', 0]

How to make the function checks the bad chars 'hi', and return a value of ... say 0-1 (meaning the function found an error ) ?

This is not something you do in an optimized string routine like strlen so don't worry about that.

Quote
The RosAsm porting of your function, is like this:

I don't know how RoAsm implement prologue code but I assume a stack frame is created so this adds code on top which will destroy the alignment. The algorithm is designed as a bare bone modular construct without any additional code added. Any changes above L1 should therefore (at best) adapt to this logic. A list file is helpful to view how this plays out:
00000000  8B442404                  mov     eax,[esp+4]
00000004  0FBAE000                  bt      eax,0
00000008  7246                      jc      L3
0000000A  8B4C2404                  mov     ecx,[esp+4]
0000000E  83E0F0                    and     eax,-16
00000011  83E10F                    and     ecx,16-1
00000014  83CAFF                    or      edx,-1
00000017  D3E2                      shl     edx,cl
00000019  660FEFC0                  pxor    xmm0,xmm0
0000001D  660F7500                  pcmpeqw xmm0,[eax]
00000021  83C010                    add     eax,16
00000024  660FD7C8                  pmovmskb ecx,xmm0
00000028  660FEFC0                  pxor    xmm0,xmm0
0000002C  23CA                      and     ecx,edx
0000002E  7512                      jnz     L2
00000030                        L1:

If you add code to the mix the alignment breaks.

00000000  55                        push    ebp
00000001  8BEC                      mov     ebp,esp
...
00000033                        L1:

The code above the loop is not that important speed-wise so you basically selected instructions by size for aligning the loop below (L1).

00000000  55                        push    ebp
00000001  8BEC                      mov     ebp,esp
00000003  8B4508                    mov     eax,[ebp+8] ; - one byte
00000006  0FBAE000                  bt      eax,0
0000000A  7244                      jc      L3
0000000C  8BC8                      mov     ecx,eax     ; - two bytes
0000000E  83E0F0                    and     eax,-16
00000011  83E10F                    and     ecx,16-1
00000014  83CAFF                    or      edx,-1
00000017  D3E2                      shl     edx,cl
00000019  660FEFC0                  pxor    xmm0,xmm0
0000001D  660F7500                  pcmpeqw xmm0,[eax]
00000021  83C010                    add     eax,16
00000024  660FD7C8                  pmovmskb ecx,xmm0
00000028  660FEFC0                  pxor    xmm0,xmm0
0000002C  23CA                      and     ecx,edx
0000002E  7512                      jnz     L2
00000030                        L1:

Quote
Proc UnicodeStrlen:
    Arguments @Input
    Uses ecx, edx, edi ; <----preserve ecx, edx, edi registers - I benchmark the speed with all algos preserving the registers to be sure they are all behaving on the same way to see which one is faster etc under the very same conditions.

Well, preserving registers here is bad advised (and edi is not even used) but it is possible to save two bytes (make room for two pushes) by flipping PXOR to XORPS. This approach may be more reliable than using a benchmark to adjust these things.

AW

  • Member
  • *****
  • Posts: 2583
  • Let's Make ASM Great Again!
Re: Faster Memcopy ...
« Reply #108 on: November 16, 2019, 01:12:02 AM »
This performs better with small strings:

Code: [Select]
    .686
    .xmm
    .model flat

    .code
    mov     eax,-16
    mov     edx,[esp+4]
xorps xmm0, xmm0
@loop:
    add eax, 16
    PCMPISTRI xmm0, xmmword ptr [edx + eax], 1000b
    jnz @loop
    add eax, ecx
    ret

    end

total [0 .. 40], 8++
   262586 cycles 5.asm: PCMPISTRI
   329094 cycles 3.asm: SSE Intel Silvermont
   366495 cycles 1.asm: SSE 16
   453912 cycles 2.asm: SSE 32
   682335 cycles 0.asm: msvcrt.strlen()
   728151 cycles 4.asm: SSE Intel Atom
hit any key to continue...

(*) Note: can read up to 15 characters past end of string and cause an exception in some cases, normally cross pages.

nidud

  • Member
  • *****
  • Posts: 1972
    • https://github.com/nidud/asmc
Re: Faster Memcopy ...
« Reply #109 on: November 16, 2019, 02:39:42 AM »
64-bit version of wcslen:

    .code

wcslen::

    test    ecx,1
    jnz     L3
    mov     r8,rcx
    mov     rax,rcx
    and     rax,-32
    and     ecx,32-1
    mov     edx,-1
    shl     edx,cl
    pxor    xmm0,xmm0
    vpcmpeqw ymm1,ymm0,[rax]
    add     rax,32
    vpmovmskb ecx,ymm1
    and     ecx,edx
    jnz     L2
L1:
    vpcmpeqw ymm1,ymm0,[rax]
    vpmovmskb ecx,ymm1
    add     rax,32
    test    ecx,ecx
    jz      L1
L2:
    bsf     ecx,ecx
    lea     rax,[rax+rcx-32]
    sub     rax,r8
    shr     rax,1
    ret
L3:
    push    rdi
    mov     rdi,rcx
    xor     eax,eax
    mov     rcx,-1
    repne   scasw
    not     rcx
    dec     rcx
    mov     rax,rcx
    pop     rdi
    ret

    end

bench:

total [0 .. 40], 8++
   338995 cycles 3.asm: AVX 32
   509346 cycles 2.asm: SSE 16
  1565427 cycles 0.asm: msvcrt.wcslen()
  4166297 cycles 1.asm: scasw
total [41 .. 80], 7++
   311340 cycles 3.asm: AVX 32
   605787 cycles 2.asm: SSE 16
  2549266 cycles 0.asm: msvcrt.wcslen()
  4701738 cycles 1.asm: scasw
total [600 .. 1000], 100++
   425699 cycles 3.asm: AVX 32
  1081618 cycles 2.asm: SSE 16
  3821484 cycles 0.asm: msvcrt.wcslen()
  7475438 cycles 1.asm: scasw

jj2007

  • Member
  • *****
  • Posts: 10438
  • Assembler is fun ;-)
    • MasmBasic
Re: Faster Memcopy ...
« Reply #110 on: November 16, 2019, 03:47:45 AM »
Testbed with Guga's Unicode version, using ebp:

Code: [Select]
Intel(R) Core(TM) i5-2450M CPU @ 2.50GHz (SSE4)

7047    cycles for 100 * CRT strlen
5265    cycles for 100 * Masm32 StrLen
19964   cycles for 100 * Windows lstrlen
1711    cycles for 100 * MasmBasic Len
1499    cycles for 100 * Algo1
2918    cycles for 100 * AlgoGuga

7097    cycles for 100 * CRT strlen
5311    cycles for 100 * Masm32 StrLen
19940   cycles for 100 * Windows lstrlen
1704    cycles for 100 * MasmBasic Len
1497    cycles for 100 * Algo1
2929    cycles for 100 * AlgoGuga

7030    cycles for 100 * CRT strlen
5250    cycles for 100 * Masm32 StrLen
19956   cycles for 100 * Windows lstrlen
1674    cycles for 100 * MasmBasic Len
1494    cycles for 100 * Algo1
2935    cycles for 100 * AlgoGuga

7057    cycles for 100 * CRT strlen
5268    cycles for 100 * Masm32 StrLen
20014   cycles for 100 * Windows lstrlen
1714    cycles for 100 * MasmBasic Len
1511    cycles for 100 * Algo1
3000    cycles for 100 * AlgoGuga

14      bytes for CRT strlen
10      bytes for Masm32 StrLen
10      bytes for Windows lstrlen
10      bytes for MasmBasic Len
74      bytes for Algo1
87      bytes for AlgoGuga

100     = eax CRT strlen
100     = eax Masm32 StrLen
100     = eax Windows lstrlen
100     = eax MasmBasic Len
100     = eax Algo1
100     = eax AlgoGuga

AW

  • Member
  • *****
  • Posts: 2583
  • Let's Make ASM Great Again!
Re: Faster Memcopy ...
« Reply #111 on: November 16, 2019, 05:20:30 AM »
This is the Unicode version using PCMPISTRI

Code: [Select]
    .code
    mov     rax,-16
    mov     rdx, rcx
    xorps xmm0, xmm0
@loop:
    add rax, 16
    PCMPISTRI xmm0, xmmword ptr [rdx + rax], 1001b
    jnz @loop
    shr eax, 1
    add eax, ecx
    ret
    end

total [0 .. 40], 8++
   323358 cycles 3.asm: AVX 32
   404277 cycles 4.asm: PCMPISTRI
   477789 cycles 2.asm: SSE 16
  1237417 cycles 0.asm: msvcrt.wcslen()
  3886924 cycles 1.asm: scasw

  total [41 .. 80], 7++
   291655 cycles 3.asm: AVX 32
   562089 cycles 2.asm: SSE 16
   563122 cycles 4.asm: PCMPISTRI
  1935096 cycles 0.asm: msvcrt.wcslen()
  4320489 cycles 1.asm: scasw
 
  total [600 .. 1000], 100++
   333669 cycles 3.asm: AVX 32
   982307 cycles 2.asm: SSE 16
  1405725 cycles 4.asm: PCMPISTRI
  3490272 cycles 0.asm: msvcrt.wcslen()
  6914474 cycles 1.asm: scasw

nidud

  • Member
  • *****
  • Posts: 1972
    • https://github.com/nidud/asmc
Re: Faster Memcopy ...
« Reply #112 on: November 16, 2019, 06:30:24 AM »
Aligned version

    .code
    mov     r8,rcx
    mov     rax,rcx
    and     rax,-16
    and     rcx,16-1
    mov     edx,-1
    shl     edx,cl
    pxor    xmm0,xmm0
    pcmpeqw xmm0,[rax]
    add     rax,16
    pmovmskb ecx,xmm0
    pxor    xmm0,xmm0
    and     ecx,edx
    bsf     ecx,ecx
    jnz     L2
L1:
    pcmpistri xmm0,[rax],9
    lea     rax,[rax+16]
    jnz     L1
L2:
    lea     rax,[rax+rcx-16]
    sub     rax,r8
    shr     rax,1
    ret
    end

Similar results

total [0 .. 40], 8++
   339355 cycles 3.asm: AVX 32
   468741 cycles 4.asm: SSE4_2 PCMPISTRI
   508062 cycles 2.asm: SSE 16
  1618825 cycles 0.asm: msvcrt.wcslen()
  4159568 cycles 1.asm: scasw

total [41 .. 80], 7++
   314024 cycles 3.asm: AVX 32
   609103 cycles 2.asm: SSE 16
   622895 cycles 4.asm: SSE4_2 PCMPISTRI
  2544738 cycles 0.asm: msvcrt.wcslen()
  4694525 cycles 1.asm: scasw

total [600 .. 1000], 100++
   427188 cycles 3.asm: AVX 32
  1081861 cycles 2.asm: SSE 16
  1558990 cycles 4.asm: SSE4_2 PCMPISTRI
  3815976 cycles 0.asm: msvcrt.wcslen()
  7475706 cycles 1.asm: scasw

guga

  • Member
  • *****
  • Posts: 1194
  • Assembly is a state of art.
    • RosAsm
Re: Faster Memcopy ...
« Reply #113 on: November 16, 2019, 10:23:57 AM »
Tks a lot, JJ :)

It seems is fast as expected.
Here is the results:

Code: [Select]
AMD Ryzen 5 2400G with Radeon Vega Graphics     (SSE4)

5842 cycles for 100 * CRT strlen
5813 cycles for 100 * Masm32 StrLen
18946 cycles for 100 * Windows lstrlen
1878 cycles for 100 * MasmBasic Len
1545 cycles for 100 * Algo1
2579 cycles for 100 * AlgoGuga

5769 cycles for 100 * CRT strlen
5711 cycles for 100 * Masm32 StrLen
18825 cycles for 100 * Windows lstrlen
2350 cycles for 100 * MasmBasic Len
1917 cycles for 100 * Algo1
3390 cycles for 100 * AlgoGuga

7980 cycles for 100 * CRT strlen
7723 cycles for 100 * Masm32 StrLen
24159 cycles for 100 * Windows lstrlen
2372 cycles for 100 * MasmBasic Len
1930 cycles for 100 * Algo1
2587 cycles for 100 * AlgoGuga

6088 cycles for 100 * CRT strlen
7132 cycles for 100 * Masm32 StrLen
22808 cycles for 100 * Windows lstrlen
1673 cycles for 100 * MasmBasic Len
1609 cycles for 100 * Algo1
2636 cycles for 100 * AlgoGuga

14 bytes for CRT strlen
10 bytes for Masm32 StrLen
10 bytes for Windows lstrlen
10 bytes for MasmBasic Len
74 bytes for Algo1
87 bytes for AlgoGuga

100 = eax CRT strlen
100 = eax Masm32 StrLen
100 = eax Windows lstrlen
100 = eax MasmBasic Len
100 = eax Algo1
100 = eax AlgoGuga

--- ok ---


Coding in Assembly requires a mix of:
80% of brain, passion, intuition, creativity
10% of programming skills
10% of alcoholic levels in your blood.

My Code Sites:
http://rosasm.freeforums.org
http://winasm.tripod.com

jj2007

  • Member
  • *****
  • Posts: 10438
  • Assembler is fun ;-)
    • MasmBasic
Re: Faster Memcopy ...
« Reply #114 on: November 16, 2019, 03:36:03 PM »
Tks a lot, JJ :)

It seems is fast as expected.

Thanks, Guga. Here is a new version, 59 bytes short and pretty fast:
Code: [Select]
Intel(R) Core(TM) i5-2450M CPU @ 2.50GHz (SSE4)

6984    cycles for 100 * CRT strlen
5216    cycles for 100 * Masm32 StrLen
19898   cycles for 100 * Windows lstrlen
1674    cycles for 100 * MasmBasic Len
1500    cycles for 100 * Algo1/Nidud
2608    cycles for 100 * Algo1/Guga+JJ

7035    cycles for 100 * CRT strlen
5249    cycles for 100 * Masm32 StrLen
19929   cycles for 100 * Windows lstrlen
1700    cycles for 100 * MasmBasic Len
1569    cycles for 100 * Algo1/Nidud
2625    cycles for 100 * Algo1/Guga+JJ

7087    cycles for 100 * CRT strlen
5215    cycles for 100 * Masm32 StrLen
19859   cycles for 100 * Windows lstrlen
1701    cycles for 100 * MasmBasic Len
1498    cycles for 100 * Algo1/Nidud
2644    cycles for 100 * Algo1/Guga+JJ

14      bytes for CRT strlen
10      bytes for Masm32 StrLen
10      bytes for Windows lstrlen
10      bytes for MasmBasic Len
53      bytes for Algo1/Nidud
59      bytes for Algo1/Guga+JJ

100     = eax CRT strlen
100     = eax Masm32 StrLen
100     = eax Windows lstrlen
100     = eax MasmBasic Len
100     = eax Algo1/Nidud
100     = eax Algo1/Guga+JJ

Note that results are not fully comparable. Masm32 len(), MasmBasic Len() and Algo1/Guga+JJ preserve edx and ecx. This is unusual, but the good ol' Masm32 len() algo did this, so in order to avoid compatibility problems, Len() behaves the same. In addition, Len() preserves xmm0. Still, Len() is over 4 times faster than CRT strlen :cool:

daydreamer

  • Member
  • *****
  • Posts: 1310
  • building nextdoor
Re: Faster Memcopy ...
« Reply #115 on: November 16, 2019, 08:04:12 PM »
Hi Nidud....The unicode version is working as expected :)

One question. How to implement a security inside the function to see either the string is really unicode or not, while it is calculating the lenght ?

I mean....say i have a bad unicode string like this:

[TestingUnicode: B$ 'H', 0, 'e', 0, 'hi', 0]

How to make the function checks the bad chars 'hi', and return a value of ... say 0-1 (meaning the function found an error ) ?


in my experience when using a texteditor and start to put in unicode characters,the texteditor detects I have entered or copy/pasted unicode and it asks if  I want to save it unicode format or choose to lose that info saving in good old ascii,my suggestion is support a unicode or not flag somewhere in the copy routine,so if its used by a texteditor/gui creator,puts code to change unicode flag as soon as it detects unicode usage by user,maybe add superfast routine that expands ascii-to unicode when that happens?
Quote from Flashdance
Nick  :  When you give up your dream, you die
*wears a flameproof asbestos suit*
Gone serverside programming p:  :D
I love assembly,because its legal to write
princess:lea eax,luke
:)

hutch--

  • Administrator
  • Member
  • ******
  • Posts: 7418
  • Mnemonic Driven API Grinder
    • The MASM32 SDK
Re: Faster Memcopy ...
« Reply #116 on: November 16, 2019, 09:14:42 PM »
Just write a UNICODE editor, no conversions.
hutch at movsd dot com
http://www.masm32.com    :biggrin:  :skrewy:

AW

  • Member
  • *****
  • Posts: 2583
  • Let's Make ASM Great Again!
Re: Faster Memcopy ...
« Reply #117 on: November 17, 2019, 01:18:53 AM »
Visual Studio and some text editors like Notepad++ detect Unicode on save.
But in this thread, what we have been calling Unicode are word length characters, that is the UTF-16 character set without the additional planes for emoji and other stuff. This is what Windows internals use. For programming and for most purposes it is very OK, no need to complicate things further. However there is more to it.

Windows has the IsTextUnicode API function but it sometimes fails. I have seen and used better algos but am never sure it will not fail somewhere.




guga

  • Member
  • *****
  • Posts: 1194
  • Assembly is a state of art.
    • RosAsm
Re: Faster Memcopy ...
« Reply #118 on: November 17, 2019, 07:36:08 AM »
Just write a UNICODE editor, no conversions.

Hi Steve, tks...But i was thinking on a function able to detect it for disassembler purposes. While i was testing the GDI issues of that other post, i faced a really old problem on RosAsm and decided start a minor update on some of the routines. The routine i´m currently working on is the disassembler that also uses routines to check chunk of bytes. I remember using a very very old function for string lenght and updated it, but it was not fast enough, that´s why i decided to give a try on these.

Since RosAsm (now) do disassemble unicode strings properly, i´m trying also a unicode string lenght check and a routine to automatically detect simple unicode strings (Char+0, Char+0 etc). So, for a disassembler point of view, a faster routine able to determine whether a string is unicode or ansi (even simple unicode format) maybe necessary to avoid bad recognitions.

I succeeded yesterday to fix  a old bug in RosAsm disassembler that caused the dissassemblement process to be kind of slow in while it is working in some apps or dlls. For example, on shell32.dll or atioglxx.dll for Xp (12Mb and 17 Mb respectively) , the disassembler was taking almost 25 minutes to finish due to heavily computation of code/data identification and the main functions were going back and forward thousands of times. It was a stupid design mistake i (or rené..i don´t remember) did years ago, forcing one of the functions to allocate and deallocate a huge amount of memory every time the routine is used (and it is used thousands on times for problematic files) .  Now, the process is way fast for some problematic files. The same test i made on shell32.dll disassembled it in about 20 seconds and kept the accuracy of the data that was disassembled)

Coding in Assembly requires a mix of:
80% of brain, passion, intuition, creativity
10% of programming skills
10% of alcoholic levels in your blood.

My Code Sites:
http://rosasm.freeforums.org
http://winasm.tripod.com

hutch--

  • Administrator
  • Member
  • ******
  • Posts: 7418
  • Mnemonic Driven API Grinder
    • The MASM32 SDK
Re: Faster Memcopy ...
« Reply #119 on: November 17, 2019, 09:42:44 AM »
As long as you can safely identify character data, either ANSI or UNICODE, it may be viable to have a manual option to switch between the two. There is no truly reliable method of detecting the difference so you use the best you can get/write and have a manual option.
hutch at movsd dot com
http://www.masm32.com    :biggrin:  :skrewy: