Author Topic: Faster Memcopy ...  (Read 41221 times)

guga

  • Member
  • *****
  • Posts: 1285
  • Assembly is a state of art.
    • RosAsm
Re: Faster Memcopy ...
« Reply #120 on: November 17, 2019, 11:50:24 AM »
Manual option ?  :thumbsup: :thumbsup: :thumbsup: Indeed. Yeah...maybe i´ll implement it later when i finish the fixes. I plan to add a couple of user options as soon i have a few time to dedicate to work on RosAsm updates again. The current version can identifies with relatively accuracy what is ANSI or Unicode strings, but still have some minor problems because on some cases a chunk of data can be either a string or a pointer and it is not so easy to fix that without the help of other tools like a Signature system i started developing years ago, but never finished.

Manual options could be good to make as in IdaPro, allowing the user to choose either he wants to disassemble C-Style strings, Dos Style, pascal Styles, Delphi strings etc..But, for those specific cases (of strings that are used to certain compilers) the better should be i do it when (or if) succeed to finish the signature technique (i called that DIS - Digital Identification System) many years ago.

The current routine does the job in more then 90% of the time for real apps. The disassembler sets a couple of flags to forbidden areas of the PE to avoid those to be  disassembled. For example, sections flagged as import, export, resources etc etc...It basically identifies the good code and data sections and those are the ones that are actually disassembled.

One of the problems is that, inside the code section it is common to we find embedded data, structures, strings etc. Although the disassembler works reasonable fine on those sections, it do have files that produces wrong results...but those i´ll fix later once i fix some minor bugs in RosAsm.

I´m comparing the results of the fixes i´m currently doing and they are more accurate then the ones in IdaPro (Not considering the Flirt system used on Ida, of course), but still it have some issues.

Eventually i´ll try to create a set of macros or internal routines to allow compilation of masm syntax style, but..this will still take more time.
Coding in Assembly requires a mix of:
80% of brain, passion, intuition, creativity
10% of programming skills
10% of alcoholic levels in your blood.

My Code Sites:
http://rosasm.freeforums.org
http://winasm.tripod.com

daydreamer

  • Member
  • *****
  • Posts: 1360
  • building nextdoor
Re: Faster Memcopy ...
« Reply #121 on: November 17, 2019, 08:50:52 PM »
Just write a UNICODE editor, no conversions.
I already worked on unicode richedit,but as asm programmer I cannot resist downsizing tricks like half the size of  text data,upsizing with add the start of that specific language it belongs to and mov to character buffer
Quote from Flashdance
Nick  :  When you give up your dream, you die
*wears a flameproof asbestos suit*
Gone serverside programming p:  :D
I love assembly,because its legal to write
princess:lea eax,luke
:)

daydreamer

  • Member
  • *****
  • Posts: 1360
  • building nextdoor
Re: Faster Memcopy ...
« Reply #122 on: November 17, 2019, 09:12:38 PM »
Manual options could be good to make as in IdaPro, allowing the user to choose either he wants to disassemble C-Style strings, Dos Style, pascal Styles, Delphi strings etc..But, for those specific cases (of strings that are used to certain compilers) the better should be i do it when (or if) succeed to finish the signature technique (i called that DIS - Digital Identification System) many years ago.

The current routine does the job in more then 90% of the time for real apps. The disassembler sets a couple of flags to forbidden areas of the PE to avoid those to be  disassembled. For example, sections flagged as import, export, resources etc etc...It basically identifies the good code and data sections and those are the ones that are actually disassembled.

90% great job :thumbsup: some of the fail % it maybe is creator of code tried to make it harder to disassemble?
why dont make it like the smartphone editor: its best guess of string format is showed and a tiny sample is showed to the user and option to choose from a list of different text formats?I had no idea there was loads of different text formats
I also can think of hardcoded mov eax,"abcd" or mov eax,"ab" ;unicode can be very hard to handle in a disassembler
good to know is howto detect text files are saved in utf16 format
Quote from Flashdance
Nick  :  When you give up your dream, you die
*wears a flameproof asbestos suit*
Gone serverside programming p:  :D
I love assembly,because its legal to write
princess:lea eax,luke
:)

hutch--

  • Administrator
  • Member
  • ******
  • Posts: 7542
  • Mnemonic Driven API Grinder
    • The MASM32 SDK
Re: Faster Memcopy ...
« Reply #123 on: November 17, 2019, 10:15:24 PM »
Using rich edit, the shift from ANSI to UNICODE is simple enough to do. If you have to work in both, load the file and if it looks like garbage, switch from one to the other. Just means reloading the file again. You can make tangled messes that still can't garrantee being correct or switch between the two, the latter is much easier.
hutch at movsd dot com
http://www.masm32.com    :biggrin:  :skrewy:

AW

  • Member
  • *****
  • Posts: 2583
  • Let's Make ASM Great Again!
Re: Faster Memcopy ...
« Reply #124 on: November 17, 2019, 11:03:31 PM »
This is another one, this time for an AVX-512 strlen (based on this url)

Code: [Select]
strlenZMM:
push esi
mov eax, 01010101h
vpbroadcastd zmm2, eax ; broadcast eax to all elements
xor edx, edx ; len = 0
mov eax, 80808080h
vpbroadcastd zmm3, eax
mov esi, [esp+8]
@@:
vmovdqu32 zmm0, ZMMWORD PTR [esi+edx]
vpsubd zmm1, zmm0, zmm2
vpternlogd zmm1, zmm0, zmm3, 32
vptestmd k1, zmm1, zmm1
kmovw eax, k1
movzx eax, ax
test ax, ax
jnz @F
add edx, 64
jmp short @B
@@:
bsf eax, eax
push 32
pop ecx
cmovne ecx, eax
lea esi, dword ptr [esi+ecx*4]
cmp byte ptr [esi+edx], 0
lea eax, dword ptr [edx+ecx*4]
je short @exit
cmp byte ptr [esi+edx+1], 0
jne short @F
inc eax
jmp short @exit
@@:
cmp byte ptr [esi+edx+2], 0
jne short @F
add eax, 2
jmp short @exit
@@:
add eax, 3
@exit:
vzeroupper
pop esi
ret
    end

I added a 4th test for strings between 40000 and 40900 to see it the AVX-512 decouples. Well, not really, SSE Intel Silvermont and SSE Intel Atom are there as well.  :hmmm:

total [0 .. 40], 8++
   290780 cycles 7.asm: sse2
   355355 cycles 5.asm: PCMPISTRI
   412251 cycles 3.asm: SSE Intel Silvermont
   469664 cycles 8.asm: Agner Fog
   502841 cycles 1.asm: SSE 16
   524321 cycles 2.asm: SSE 32
   597335 cycles 9.asm: ZMM AVX512
   865552 cycles 4.asm: SSE Intel Atom
   908227 cycles 6.asm: scasb
   913651 cycles 0.asm: msvcrt.strlen()
   
   
total [41 .. 80], 7++
   270380 cycles 3.asm: SSE Intel Silvermont
   299431 cycles 5.asm: PCMPISTRI
   306940 cycles 7.asm: sse2
   314735 cycles 1.asm: SSE 16
   364536 cycles 9.asm: ZMM AVX512
   380247 cycles 8.asm: Agner Fog
   405156 cycles 2.asm: SSE 32
   639091 cycles 4.asm: SSE Intel Atom
   758265 cycles 6.asm: scasb
   982403 cycles 0.asm: msvcrt.strlen()

   total [600 .. 1000], 100++
   202227 cycles 9.asm: ZMM AVX512
   237534 cycles 3.asm: SSE Intel Silvermont
   292854 cycles 4.asm: SSE Intel Atom
   334146 cycles 2.asm: SSE 32
   338568 cycles 1.asm: SSE 16
   356720 cycles 7.asm: sse2
   436840 cycles 8.asm: Agner Fog
   650222 cycles 5.asm: PCMPISTRI
  1438033 cycles 6.asm: scasb
  1830544 cycles 0.asm: msvcrt.strlen()
 
total [40000 .. 40900], 100++
  2161645 cycles 3.asm: SSE Intel Silvermont
  2224521 cycles 4.asm: SSE Intel Atom
  2342704 cycles 9.asm: ZMM AVX512
  3137064 cycles 1.asm: SSE 16
  3465817 cycles 7.asm: sse2
  3514206 cycles 2.asm: SSE 32
  4113016 cycles 8.asm: Agner Fog
  6173622 cycles 5.asm: PCMPISTRI
 13022424 cycles 6.asm: scasb
 16670776 cycles 0.asm: msvcrt.strlen() 

nidud

  • Member
  • *****
  • Posts: 1980
    • https://github.com/nidud/asmc
Re: Faster Memcopy ...
« Reply #125 on: November 18, 2019, 12:53:02 AM »
Have no way of testing this but an aligned version of 64 byte (cache-line) in 64-bit may be something like this.

    .code

strlen::

    xor             eax,eax
    vpbroadcastq    zmm0,rax

    mov             r8,rcx
    mov             rax,rcx
    and             rax,-64
    and             ecx,64-1
    xor             edx,edx
    dec             rdx   
    shl             rdx,cl
    vpcmpgtb        k1{k2},zmm0,[rax]
    kmovd           ecx,k1
    add             rax,32
    and             ecx,edx
    jnz             L2
    kmovd           ecx,k2
    add             rax,32
    shr             rdx,32
    and             rcx,rdx
    jnz             L2

L1:
    vpcmpgtb        k1{k2},zmm0,[rax]
    kmovd           ecx,k1
    add             rax,32
    test            ecx,ecx
    jnz             L2
    kmovd           ecx,k2
    add             rax,32
    test            ecx,ecx
    jz              L1

L2:
    bsf             ecx,ecx
    lea             rax,[rax+rcx-32]
    sub             rax,r8
    ret

    end

AW

  • Member
  • *****
  • Posts: 2583
  • Let's Make ASM Great Again!
Re: Faster Memcopy ...
« Reply #126 on: November 18, 2019, 05:19:44 AM »
@nidud,

It has a bug, I did not figure out yet the logic but it does not pass through this.
00007ff65266174e  test         ecx, ecx 
00007ff652661750  jz         0x7ff652661734 

You can debug with VS 2012 or 2015 and the Intel SDE Debugger. I don't know if it works with the Express or Community, I have the Pro for those years.
Later: I think it does because I read this:
http://masm32.com/board/index.php?topic=6473.msg69456#msg69456

AW

  • Member
  • *****
  • Posts: 2583
  • Let's Make ASM Great Again!
Re: Faster Memcopy ...
« Reply #127 on: November 18, 2019, 08:07:16 AM »
I understood the logic and corrected it in the following way and it does not need to be aligned.

Code: [Select]
.code

avx512aligned proc
    sub rsp, 8
    xor             rax,rax
    vpbroadcastq    zmm0,rax
    mov             r8,rcx
    mov             rax,rcx
    and             rax,-64
    and             rcx,64-1
    xor             rdx,rdx
    dec             rdx   
    shl             rdx,cl
    vpcmpgtb        k1,zmm0,[rax]
    kmovq           rcx,k1
    and             rcx,rdx
    jnz             L2
 
L1:
    add             rax,64
    vpcmpgtb        k1,zmm0,[rax]
    kmovq           rcx,k1
    test            rcx,rcx
    jz              L1

L2:
    bsf             rcx,rcx
    lea             rax,[rax+rcx]
    sub             rax,r8
    dec rax
    add rsp, 8
    ret
avx512aligned endp

end

What a mess was there with the k2! Simply remove it.

guga

  • Member
  • *****
  • Posts: 1285
  • Assembly is a state of art.
    • RosAsm
Re: Faster Memcopy ...
« Reply #128 on: November 18, 2019, 11:45:21 AM »
Hi, DayDreamer
"90% great job :thumbsup: some of the fail % it maybe is creator of code tried to make it harder to disassemble?"
Thanks :thumbsup: :thumbsup: :)  About the code being harder due to creator choice, well, not necessarily. The vast majority of time disassembler fails to identify is due to characteristics of each compiler or it´s libraries (When the file was not packed, of course).For example, VisualBasic6 code contains a lot of embedded data inside the code section. Some delphi or borland files have that too. Plain C using Api´s by the other-hand are somewhat easier to disassemble because the code and data are more distinguishable from each other. Also some C++ files too are somewhat easier. What makes disablement process a bit hard are heavily bad encoded libraries, specially when they are made with C++ for example or when there are trash code inside, i mean, functions that are never used. On those situations (functions that don´t have any reference), following the data chain is tricky, because some of that chunks can be either code or data.

Sure that there is not any disassembler will be 100% accurate, but if we can get a higher rate of accuracy, the better. I`m pretty sure that if i finish the DIS System (Similar to Flirt on Ida), i can reach something around 98%of accuracy, but..no plans/time to do that right now. I still have to fix lot of problems inside RosAsm yet, and try to make their inner functions be less attached to the interface (and, rebuild it completely eventually). I´ll have to isolate the encoder, the disassembler, the debugger, and create a new resource editor to only then i can try rewrite the interface or implement a new set of macros (or internal routines) to force it to work with something closer to a masm syntax too.


Hi Steve :)
"Using rich edit, the shift from ANSI to UNICODE is simple enough to do. If you have to work in both, load the file and if it looks like garbage, switch from one to the other. Just means reloading the file again. You can make tangled messes that still can't garrantee being correct or switch between the two, the latter is much easier."
Interesting idea. :) Easier indeed.
Coding in Assembly requires a mix of:
80% of brain, passion, intuition, creativity
10% of programming skills
10% of alcoholic levels in your blood.

My Code Sites:
http://rosasm.freeforums.org
http://winasm.tripod.com

jj2007

  • Member
  • *****
  • Posts: 10546
  • Assembler is fun ;-)
    • MasmBasic
Re: Faster Memcopy ...
« Reply #129 on: November 18, 2019, 08:58:06 PM »
"Using rich edit, the shift from ANSI to UNICODE is simple enough..."

The RichEdit control can use rtf under the hood, and thus has no problems using Unicode. In RichMasm, for example, your source can be Ansi or Unicode or a mix of both, no problem. The more interesting part is what you pass from editor to assembler - and that must be Utf-8, if you want to display non-English text in your executable.

guga

  • Member
  • *****
  • Posts: 1285
  • Assembly is a state of art.
    • RosAsm
Re: Faster Memcopy ...
« Reply #130 on: November 18, 2019, 11:34:00 PM »
Hi Jj. Tks.

I tested the new updated and it is working faster :)

About the UTF8, so i always need to convert it to UTF8 before displaying whenever i load a file or immediately before i show it on screen, and don´t need to make a user choice ? I´m not sure, if i understood what you meant with passing from editor to assembly.
You open a file (unicode or ansi) in RichMAsm, and export it as UTF 8 or the UTF8 conversion is done internally only to display the asm text on the screen ?
Coding in Assembly requires a mix of:
80% of brain, passion, intuition, creativity
10% of programming skills
10% of alcoholic levels in your blood.

My Code Sites:
http://rosasm.freeforums.org
http://winasm.tripod.com

jj2007

  • Member
  • *****
  • Posts: 10546
  • Assembler is fun ;-)
    • MasmBasic
Re: Faster Memcopy ...
« Reply #131 on: November 19, 2019, 12:35:25 AM »
This is OT but I'll keep it short:

include \masm32\MasmBasic\MasmBasic.inc         ; download
  Init
  uMsgBox 0, "Добро пожаловать", "歡迎", MB_OK
EndOfCode


- RichMasm exports this source to whatever.asm as Utf-8
- the assembler (ML, UAsm, AsmC) sees an 8-bit text but doesn't care whether that's codepage 1252 or 65001 or whatever
- the executable sees an 8-bit text, and gets told via the u in uMsgBox that a MessageBoxW is required, and that it should kindly translate the Utf-8 to Utf-16 before passing it on to MessageBoxW

That's the whole trick. For the coder, it's convenient because he can write everything directly into your source. And of course, comments can be in any encoding, the RichEdit control uses RTF to store it, and the assemblers don't care.

nidud

  • Member
  • *****
  • Posts: 1980
    • https://github.com/nidud/asmc
Re: Faster Memcopy ...
« Reply #132 on: November 19, 2019, 03:17:18 AM »
I understood the logic and corrected it in the following way and it does not need to be aligned.

Yes you will get 64 bits there but you need to align the input for safety.

Quote
What a mess was there with the k2! Simply remove it.

Think it also must be set to actually work so you may use it directly for the first fetch.

    mov             r8,rcx
    xor             eax,eax
    vpbroadcastq    zmm0,rax
    mov             rax,rcx
    and             rax,-64
    and             ecx,64-1
    mov             rdx,-1
    shl             rdx,cl
    kmovq           k2,rdx
    vpcmpeqb        k1{k2},zmm0,[rax]
    jmp             L2
L1:
    vpcmpeqb        k1,zmm0,[rax]
L2:
    kmovq           rcx,k1
    add             rax,64
    test            rcx,rcx
    jz              L1
    bsf             rcx,rcx
    lea             rax,[rax+rcx-64]
    sub             rax,r8
    ret

AW

  • Member
  • *****
  • Posts: 2583
  • Let's Make ASM Great Again!
Re: Faster Memcopy ...
« Reply #133 on: November 19, 2019, 05:37:46 AM »
It works very well.  :thumbsup:

Code: [Select]
total [0 .. 40], 8++
   307793 cycles 1.asm: PCMPISTRI
   443231 cycles 0.asm: AVX-512 aligned
   521571 cycles 2.asm: ZMM AVX512 (older)

total [41 .. 80], 7++
   257807 cycles 0.asm: AVX-512 aligned
   356038 cycles 2.asm: ZMM AVX512 (older)
   370879 cycles 1.asm: PCMPISTRI
   
total [600 .. 1000], 100++
   113553 cycles 0.asm: AVX-512 aligned
   204811 cycles 2.asm: ZMM AVX512 (older)
   859649 cycles 1.asm: PCMPISTRI

total [40000 .. 40800], 100++
   897536 cycles 0.asm: AVX-512 aligned
  2127546 cycles 2.asm: ZMM AVX512 (older)
  5980835 cycles 1.asm: PCMPISTRI 

If you convert the remaining to 64-bit I will test them all.

nidud

  • Member
  • *****
  • Posts: 1980
    • https://github.com/nidud/asmc
Re: Faster Memcopy ...
« Reply #134 on: November 19, 2019, 07:12:37 AM »
Added the Intel versions as well. They where originally 32-bit so the 64-bit version would have been coded differently from the direct translation as done here.

Lot of space there now: 64 * 32 regs = 2048 byte.