News:

Masm32 SDK description, downloads and other helpful links
Message to All Guests

Main Menu

Faster Memcopy ...

Started by rrr314159, March 03, 2015, 02:40:50 PM

Previous topic - Next topic

jj2007

Quote from: nidud on November 15, 2019, 12:31:13 PM
Well, if you feed it with this string:

    align 16
    db 15 dup(0)
err db "error",0

the aligned (first) compare will be:
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 e | r r o r 0

xmm0 will be:
00FFFFFFFFFFFFFFFFFFFFFFFFFFFFFF

and there will be no jump to L2

Correct :thumbsup:

Quote from: guga on November 15, 2019, 03:26:18 PM
ABout the Unicode version. JJ, i ported yours to work with Unicode as well, and it seem to work ok, and it is faster on my machine. Can you give a test on your benchmark function to compare the speeds, pls ?

If you translate it to Masm syntax, I can do that  :thup:

guga

Hii JJ. Tks :)



I believe the porting to masm, is something like this:



UniStrLenJJ proc Input :DWORD
             ; <--- uses ebp, rather then esp. Here should have a push ebp | mov  ebp, esp at start. Forgot how to include that. Maybe using invoke token in masm, right ?
    push ecx
    push edx

    mov eax, Input
    mov ecx, eax
    and eax, -16
    and ecx, 16-1
    or edx, -1
    shl edx, cl
    xorps xmm0, xmm0
    pcmpeqw xmm0, [eax] ; <---- or it is  xmmword ptr [eax]. Don´t remember if there is a xmmword instruction in masm. It´s the same as in Algo1, but using pcmpeqw rather then pcmpeqb
    add eax, 16
    pmovmskb ecx, xmm0
    xorps xmm0, xmm0
    and ecx, edx
    jnz short Out1

InnerLoop:
     movups  xmm1, [eax]; <---- or it is  xmmword ptr [eax]. Don´t remember if there is a xmmword instruction in masm
     pcmpeqw xmm1, xmm0
     pmovmskb ecx, xmm1
     add  eax, 16
     test ecx, ecx
     jz short InnerLoop

Out1:
     bsf ecx, ecx
     lea eax, [ecx+eax-16]
     sub eax, Input
     shr eax, 1

     pop edx
     pop ecx
     retn 4 <----- or a simple ret..Don´t remember the syntax. before the ret should have a mov esp, ebp | pop ebp instructions
UniStrLenJJ     endp
Coding in Assembly requires a mix of:
80% of brain, passion, intuition, creativity
10% of programming skills
10% of alcoholic levels in your blood.

My Code Sites:
http://rosasm.freeforums.org
http://winasm.tripod.com

nidud

#107
deleted

aw27

This performs better with small strings:


    .686
    .xmm
    .model flat

    .code
    mov     eax,-16
    mov     edx,[esp+4]
xorps xmm0, xmm0
@loop:
    add eax, 16
    PCMPISTRI xmm0, xmmword ptr [edx + eax], 1000b
    jnz @loop
    add eax, ecx
    ret

    end


total [0 .. 40], 8++
   262586 cycles 5.asm: PCMPISTRI
   329094 cycles 3.asm: SSE Intel Silvermont
   366495 cycles 1.asm: SSE 16
   453912 cycles 2.asm: SSE 32
   682335 cycles 0.asm: msvcrt.strlen()
   728151 cycles 4.asm: SSE Intel Atom
hit any key to continue...

(*) Note: can read up to 15 characters past end of string and cause an exception in some cases, normally cross pages.

nidud

#109
deleted

jj2007

Testbed with Guga's Unicode version, using ebp:

Intel(R) Core(TM) i5-2450M CPU @ 2.50GHz (SSE4)

7047    cycles for 100 * CRT strlen
5265    cycles for 100 * Masm32 StrLen
19964   cycles for 100 * Windows lstrlen
1711    cycles for 100 * MasmBasic Len
1499    cycles for 100 * Algo1
2918    cycles for 100 * AlgoGuga

7097    cycles for 100 * CRT strlen
5311    cycles for 100 * Masm32 StrLen
19940   cycles for 100 * Windows lstrlen
1704    cycles for 100 * MasmBasic Len
1497    cycles for 100 * Algo1
2929    cycles for 100 * AlgoGuga

7030    cycles for 100 * CRT strlen
5250    cycles for 100 * Masm32 StrLen
19956   cycles for 100 * Windows lstrlen
1674    cycles for 100 * MasmBasic Len
1494    cycles for 100 * Algo1
2935    cycles for 100 * AlgoGuga

7057    cycles for 100 * CRT strlen
5268    cycles for 100 * Masm32 StrLen
20014   cycles for 100 * Windows lstrlen
1714    cycles for 100 * MasmBasic Len
1511    cycles for 100 * Algo1
3000    cycles for 100 * AlgoGuga

14      bytes for CRT strlen
10      bytes for Masm32 StrLen
10      bytes for Windows lstrlen
10      bytes for MasmBasic Len
74      bytes for Algo1
87      bytes for AlgoGuga

100     = eax CRT strlen
100     = eax Masm32 StrLen
100     = eax Windows lstrlen
100     = eax MasmBasic Len
100     = eax Algo1
100     = eax AlgoGuga

aw27

This is the Unicode version using PCMPISTRI


    .code
    mov     rax,-16
    mov     rdx, rcx
    xorps xmm0, xmm0
@loop:
    add rax, 16
    PCMPISTRI xmm0, xmmword ptr [rdx + rax], 1001b
    jnz @loop
    shr eax, 1
    add eax, ecx
    ret
    end


total [0 .. 40], 8++
   323358 cycles 3.asm: AVX 32
   404277 cycles 4.asm: PCMPISTRI
   477789 cycles 2.asm: SSE 16
  1237417 cycles 0.asm: msvcrt.wcslen()
  3886924 cycles 1.asm: scasw

  total [41 .. 80], 7++
   291655 cycles 3.asm: AVX 32
   562089 cycles 2.asm: SSE 16
   563122 cycles 4.asm: PCMPISTRI
  1935096 cycles 0.asm: msvcrt.wcslen()
  4320489 cycles 1.asm: scasw
 
  total [600 .. 1000], 100++
   333669 cycles 3.asm: AVX 32
   982307 cycles 2.asm: SSE 16
  1405725 cycles 4.asm: PCMPISTRI
  3490272 cycles 0.asm: msvcrt.wcslen()
  6914474 cycles 1.asm: scasw

nidud

#112
deleted

guga

Tks a lot, JJ :)

It seems is fast as expected.
Here is the results:


AMD Ryzen 5 2400G with Radeon Vega Graphics     (SSE4)

5842 cycles for 100 * CRT strlen
5813 cycles for 100 * Masm32 StrLen
18946 cycles for 100 * Windows lstrlen
1878 cycles for 100 * MasmBasic Len
1545 cycles for 100 * Algo1
2579 cycles for 100 * AlgoGuga

5769 cycles for 100 * CRT strlen
5711 cycles for 100 * Masm32 StrLen
18825 cycles for 100 * Windows lstrlen
2350 cycles for 100 * MasmBasic Len
1917 cycles for 100 * Algo1
3390 cycles for 100 * AlgoGuga

7980 cycles for 100 * CRT strlen
7723 cycles for 100 * Masm32 StrLen
24159 cycles for 100 * Windows lstrlen
2372 cycles for 100 * MasmBasic Len
1930 cycles for 100 * Algo1
2587 cycles for 100 * AlgoGuga

6088 cycles for 100 * CRT strlen
7132 cycles for 100 * Masm32 StrLen
22808 cycles for 100 * Windows lstrlen
1673 cycles for 100 * MasmBasic Len
1609 cycles for 100 * Algo1
2636 cycles for 100 * AlgoGuga

14 bytes for CRT strlen
10 bytes for Masm32 StrLen
10 bytes for Windows lstrlen
10 bytes for MasmBasic Len
74 bytes for Algo1
87 bytes for AlgoGuga

100 = eax CRT strlen
100 = eax Masm32 StrLen
100 = eax Windows lstrlen
100 = eax MasmBasic Len
100 = eax Algo1
100 = eax AlgoGuga

--- ok ---



Coding in Assembly requires a mix of:
80% of brain, passion, intuition, creativity
10% of programming skills
10% of alcoholic levels in your blood.

My Code Sites:
http://rosasm.freeforums.org
http://winasm.tripod.com

jj2007

Quote from: guga on November 16, 2019, 10:23:57 AM
Tks a lot, JJ :)

It seems is fast as expected.

Thanks, Guga. Here is a new version, 59 bytes short and pretty fast:
Intel(R) Core(TM) i5-2450M CPU @ 2.50GHz (SSE4)

6984    cycles for 100 * CRT strlen
5216    cycles for 100 * Masm32 StrLen
19898   cycles for 100 * Windows lstrlen
1674    cycles for 100 * MasmBasic Len
1500    cycles for 100 * Algo1/Nidud
2608    cycles for 100 * Algo1/Guga+JJ

7035    cycles for 100 * CRT strlen
5249    cycles for 100 * Masm32 StrLen
19929   cycles for 100 * Windows lstrlen
1700    cycles for 100 * MasmBasic Len
1569    cycles for 100 * Algo1/Nidud
2625    cycles for 100 * Algo1/Guga+JJ

7087    cycles for 100 * CRT strlen
5215    cycles for 100 * Masm32 StrLen
19859   cycles for 100 * Windows lstrlen
1701    cycles for 100 * MasmBasic Len
1498    cycles for 100 * Algo1/Nidud
2644    cycles for 100 * Algo1/Guga+JJ

14      bytes for CRT strlen
10      bytes for Masm32 StrLen
10      bytes for Windows lstrlen
10      bytes for MasmBasic Len
53      bytes for Algo1/Nidud
59      bytes for Algo1/Guga+JJ

100     = eax CRT strlen
100     = eax Masm32 StrLen
100     = eax Windows lstrlen
100     = eax MasmBasic Len
100     = eax Algo1/Nidud
100     = eax Algo1/Guga+JJ


Note that results are not fully comparable. Masm32 len(), MasmBasic Len() and Algo1/Guga+JJ preserve edx and ecx. This is unusual, but the good ol' Masm32 len() algo did this, so in order to avoid compatibility problems, Len() behaves the same. In addition, Len() preserves xmm0. Still, Len() is over 4 times faster than CRT strlen :cool:

daydreamer

Quote from: guga on November 15, 2019, 03:05:45 PM
Hi Nidud....The unicode version is working as expected :)

One question. How to implement a security inside the function to see either the string is really unicode or not, while it is calculating the lenght ?

I mean....say i have a bad unicode string like this:

[TestingUnicode: B$ 'H', 0, 'e', 0, 'hi', 0]

How to make the function checks the bad chars 'hi', and return a value of ... say 0-1 (meaning the function found an error ) ?


in my experience when using a texteditor and start to put in unicode characters,the texteditor detects I have entered or copy/pasted unicode and it asks if  I want to save it unicode format or choose to lose that info saving in good old ascii,my suggestion is support a unicode or not flag somewhere in the copy routine,so if its used by a texteditor/gui creator,puts code to change unicode flag as soon as it detects unicode usage by user,maybe add superfast routine that expands ascii-to unicode when that happens?
my none asm creations
https://masm32.com/board/index.php?topic=6937.msg74303#msg74303
I am an Invoker
"An Invoker is a mage who specializes in the manipulation of raw and elemental energies."
Like SIMD coding

hutch--

Just write a UNICODE editor, no conversions.

aw27

Visual Studio and some text editors like Notepad++ detect Unicode on save.
But in this thread, what we have been calling Unicode are word length characters, that is the UTF-16 character set without the additional planes for emoji and other stuff. This is what Windows internals use. For programming and for most purposes it is very OK, no need to complicate things further. However there is more to it.

Windows has the IsTextUnicode API function but it sometimes fails. I have seen and used better algos but am never sure it will not fail somewhere.




guga

Quote from: hutch-- on November 16, 2019, 09:14:42 PM
Just write a UNICODE editor, no conversions.

Hi Steve, tks...But i was thinking on a function able to detect it for disassembler purposes. While i was testing the GDI issues of that other post, i faced a really old problem on RosAsm and decided start a minor update on some of the routines. The routine i´m currently working on is the disassembler that also uses routines to check chunk of bytes. I remember using a very very old function for string lenght and updated it, but it was not fast enough, that´s why i decided to give a try on these.

Since RosAsm (now) do disassemble unicode strings properly, i´m trying also a unicode string lenght check and a routine to automatically detect simple unicode strings (Char+0, Char+0 etc). So, for a disassembler point of view, a faster routine able to determine whether a string is unicode or ansi (even simple unicode format) maybe necessary to avoid bad recognitions.

I succeeded yesterday to fix  a old bug in RosAsm disassembler that caused the dissassemblement process to be kind of slow in while it is working in some apps or dlls. For example, on shell32.dll or atioglxx.dll for Xp (12Mb and 17 Mb respectively) , the disassembler was taking almost 25 minutes to finish due to heavily computation of code/data identification and the main functions were going back and forward thousands of times. It was a stupid design mistake i (or rené..i don´t remember) did years ago, forcing one of the functions to allocate and deallocate a huge amount of memory every time the routine is used (and it is used thousands on times for problematic files) .  Now, the process is way fast for some problematic files. The same test i made on shell32.dll disassembled it in about 20 seconds and kept the accuracy of the data that was disassembled)

Coding in Assembly requires a mix of:
80% of brain, passion, intuition, creativity
10% of programming skills
10% of alcoholic levels in your blood.

My Code Sites:
http://rosasm.freeforums.org
http://winasm.tripod.com

hutch--

As long as you can safely identify character data, either ANSI or UNICODE, it may be viable to have a manual option to switch between the two. There is no truly reliable method of detecting the difference so you use the best you can get/write and have a manual option.