Author Topic: Faster Memcopy ...  (Read 55069 times)

nidud

  • Member
  • *****
  • Posts: 2240
    • https://github.com/nidud/asmc
Re: Faster Memcopy ...
« Reply #45 on: March 19, 2015, 09:43:57 AM »
The Intel's "read ahead" is not actually a real read ahead - just show us full code and it is for sure that they do verify of the finish condition (zero byte) before they read next data which may be beyond page range.

This is true. The Intel code do a byte test alignment before using DWORD's.
http://masm32.com/board/index.php?topic=1872.msg19450#msg19450

Quote
The full sense of read-ahead is only when the code does read only with not-dword-aligned strings

The same also applies to larger alignment like 8, 16, 32, ...

The problem in this case is:
- no alignment before read
- aligning using size > 1

This fails:
Code: [Select]
mov ecx,[esp+4]
mov eax,[ecx]
and ecx,-4
add ecx,4
push edx
lea edx,[eax-01010101h]
not eax
and eax,edx
and eax,80808080h
jz @F
bsf eax,eax
shr eax,3
pop edx
ret 4
@@:
mov eax,[ecx]

This works:
Code: [Select]
;
; Agner Fogh is moving the pointer back to read the first 1..4 bytes:
;
mov     ecx, pBuf    ; get pointer to string
mov     eax, ecx    ; copy pointer
and     ecx, 3    ; lower 2 bits of address, check alignment
jz     L2    ; string is aligned by 4. Go to loop
and     eax, -4    ; align pointer by 4
shl     ecx, 3    ; mul by 8 = displacement in bits
mov     edx, -1
shl     edx, cl    ; make byte mask
not     edx    ; mask = 0FFH for false bytes
or     edx, [eax]    ; read from nearest preceding boundary
; check first four bytes for zero
;-----------------------------------
lea     ecx, [edx-01010101H]   ; subtract 1 from each byte
not     edx    ; invert all bytes
and     ecx, edx    ; and these two
and     ecx, 80808080H    ; test all sign bits
jnz     L3    ; zero-byte found
; Main loop, read 4 bytes aligned
;-----------------------------------
L1: add     eax, 4    ; increment pointer by 4
L2: mov     edx, [eax]    ; read 4 bytes of string

Antariy

  • Member
  • ****
  • Posts: 557
Re: Faster Memcopy ...
« Reply #46 on: March 19, 2015, 09:45:35 AM »
Just re-read again what was written: if you have the address of the buffer - you have all required to check if the buffer is near of the end of the page. So what you will do after that - is for your decision, of course you may try to read a whole heap of data even if you see that the buffer you given with is in the last byte of the page, or you may process this case in its respective part - QWORD/DWORD/WORD/BYTE reading - depending on how close to the end of the page your buffer is.
In the case with that StrLen code which Jochen used in his lib, I just aligned the input address buffer to the 16 (and, take notice that I did also mention that in the first post here - mentioned that if you align the reading pointer to the power of two), aligned it "backwards" (just AND with -16 so it will always lie at least from 16 bytes from the end of the page in memory) and thus did first read not ahead but "backward" instead, after that just shifted the "backward" tail out of the masking reg. So, that SSE code MAY NOT FAIL with any zero-terminated string placed in any place in the memory.

jj2007

  • Member
  • *****
  • Posts: 11575
  • Assembler is fun ;-)
    • MasmBasic
Re: Faster Memcopy ...
« Reply #47 on: March 19, 2015, 11:14:51 AM »
In the case with that StrLen code which Jochen used in his lib, I just aligned the input address buffer to the 16 (and, take notice that I did also mention that in the first post here - mentioned that if you align the reading pointer to the power of two), aligned it "backwards" (just AND with -16 so it will always lie at least from 16 bytes from the end of the page in memory) and thus did first read not ahead but "backward" instead, after that just shifted the "backward" tail out of the masking reg. So, that SSE code MAY NOT FAIL with any zero-terminated string placed in any place in the memory.

This is correct, but to be fair, the example of the 4096 byte buffer posted below works only with SEH:

include \masm32\MasmBasic\MasmBasic.inc      ; download
  Init tc                  ; install the SEH handler

  FileWrite "test.txt", String$(4096, "x")      ; prepare a nasty example
  xchg rv(filesize, Chr$("test.txt")), ebx
  xchg rv(VirtualAlloc, 0, ebx, MEM_COMMIT, PAGE_READWRITE), edi
  push rv(_lopen, Chr$("test.txt"), OF_READ)
  invoke _lread, eax, edi, ebx      ; exactly 4096 bytes in a VirtualAlloc'ed buffer
  call _lclose
  Print Str$("%i bytes read from file\n", ebx)      ; this file is exactly one page long
  Print Str$("MasmBasic says the length of the string is %i bytes\n", Len(edi))      ; Len() WORKS OK

  PrintLine "Now trying crt_strlen:"      ; once you see this, Ctrl C is your only change...
  invoke crt_strlen, edi                  ; ... because the CRT fails miserably, it just hangs
  Inkey Str$("crt_strlen says the len of string is %i bytes\n", eax)      ; you will not see this message
  Exit
  TryCatchEnd            ; activate the handler - must be last instruction before "end start"
end start

Yes it does return 4096 bytes, while the CRT just hangs, but... we must go at least one byte beyond the limits, otherwise how would the algo know that byte 4096+1 is a zero byte...?

Still, Len() remains usable, while the CRT just hangs. You can try inserting an exception handler also for the CRT:
  Try
  invoke crt_strlen, edi      ; great, it doesn't hang any more...!
  Catch

That works in the sense that your app is no longer blocked, but look at the result :P

Interestingly enough, Masm32 len() fails silently:
  Try
  Print Str$("Masm32 len() says the length of the string is %i bytes\n", len(edi))
  Catch


And of course, all these examples work correctly if you replace the 4096 with 4095.

nidud

  • Member
  • *****
  • Posts: 2240
    • https://github.com/nidud/asmc
Re: Faster Memcopy ...
« Reply #48 on: March 19, 2015, 09:43:31 PM »
And of course, all these examples work correctly if you replace the 4096 with 4095.

Are you sure?
Code: [Select]
mov byte ptr [edi + 4096 – 1],0
add edi,15

Well, this version should read 32 byte equally safe as 1.
Code: [Select]
OPTION PROLOGUE:NONE, EPILOGUE:NONE

strlen  proc string
mov ecx,[esp+4]
xorps xmm2,xmm2
mov eax,ecx
and ecx,32-1
and eax,-32
movdqa  xmm0,[eax]
movdqa  xmm1,[eax+16]
or edx,-1
shl edx,cl
pcmpeqb xmm0,xmm2
pmovmskb ecx,xmm0
and ecx,edx
jnz done
pcmpeqb xmm1,xmm2
pmovmskb ecx,xmm1
shl ecx,16
and ecx,edx
jnz done
loop_32:
lea eax,[eax+32]
movdqa  xmm0,[eax]
movdqa  xmm1,[eax+16]
pcmpeqb xmm0,xmm2
pcmpeqb xmm1,xmm2
pmovmskb edx,xmm0
pmovmskb ecx,xmm1
shl ecx,16
or ecx,edx
jz loop_32
done:
bsf ecx,ecx
lea eax,[ecx+eax]
sub eax,[esp+4]
ret 4
strlen  endp
« Last Edit: November 17, 2019, 03:04:29 AM by nidud »

Antariy

  • Member
  • ****
  • Posts: 557
Re: Faster Memcopy ...
« Reply #49 on: March 19, 2015, 10:30:34 PM »
Once again, this algo will not fail with ANY zero-terminated string located anywhere in the memory:

Code: [Select]
OPTION PROLOGUE:NONE
OPTION EPILOGUE:NONE
align 16
AxStrLenSSE proc STDCALL lpszStr:DWORD ; SSE safe, ECX safe :), fast with unaligned strings
;mov edx,[esp+4]
;mov eax,[esp+4]
db 8bh,94h,24h,04,0,0,0
db 8bh,84h,24h,04,0,0,0
add esp,-14h
movups [esp],xmm7
mov [esp+16],ecx
;and edx,-10h
db 81h,0e2h,0F0h,0FFh,0FFh,0FFh
mov ecx,eax
pxor xmm7,xmm7
and ecx,0fh
jz @F
pcmpeqb xmm7,[edx]
add edx,10h
pmovmskb eax,xmm7
shr eax,cl
bsf eax,eax
jnz @ret
pxor xmm7,xmm7
@@: ; aligned to 16
pcmpeqb xmm7,[edx]
add edx,16
pmovmskb eax,xmm7
test eax,eax
jz @B

bsf eax,eax
sub edx,[esp+4+16+4]
lea eax,[eax+edx-10h]

@ret:
movups xmm7,[esp]
mov ecx,[esp+16]
add esp,14h
ret 4

AxStrLenSSE endp
OPTION PROLOGUE:PrologueDef
OPTION EPILOGUE:EpilogueDef

Jochen,
your "example" is WRONG. You're passing NOT PROPER string to the algo - and you should not wait for proper results. Your string is exactly unterminated - it's an infinite size from the point of view of the algorithm logic, and the results you are returning (4096) - ARE WRONG, take that or not - that is for your decision. The algo should fail there (crash) or it should return the flag which OBVIOUSLY marks that the string is incorrectly formatted - for an instance you should return -1 as mark of the string ininity. Otherwice using your StrLen you will never know that the string is not proper string - preperly terminated - with the real size of 4096 bytes - and your code will use that improper string further with full assuming that it's proper, but it is WRONG "ZERO TERMINATED" STRING. And any other components of the program yours or not yours (for example Windows), will crash or misuse that improper string, for an instance long after you did "checked" it, so after that you will long search why it crashes - because at the point where you were required to flag the improper string, you just silently returned "the fine" and "seems ok" result.
The code should never "silently substitute" the breaks in the program logic (improper zero terminated string is that stuff) with the "result which seems OK" - but your strlen does exactly this.

nidud,
your code has still the same problem - you for some reason check only first 32 bytes of the string, but the rest you're assuming safe to do two reads and only one check - the same bug there. Your code now is partially "safe" only with the strings with size smaller that 32 bytes - any other string will crash your code if it is placed in such a manner that it's zero terminator is closer to the page end than 32 bytes - i.e. if string terminated at 1...31 bytes from the page end.
 
The first my post here, written in almost proper english and very descriptive has all the info to implement fully safe algo without any assumptions or silent substitutions/hiding of errors. Sometimes I agree with the topic paster on the thought that the more descriptive and full information one tries to provide - the less respect and carefull attention to the one's descriptions/writings one haves.

nidud

  • Member
  • *****
  • Posts: 2240
    • https://github.com/nidud/asmc
Re: Faster Memcopy ...
« Reply #50 on: March 20, 2015, 12:11:59 AM »
nidud,
your code has still the same problem - you for some reason check only first 32 bytes of the string, but the rest you're assuming safe to do two reads and only one check - the same bug there. Your code now is partially "safe" only with the strings with size smaller that 32 bytes - any other string will crash your code if it is placed in such a manner that it's zero terminator is closer to the page end than 32 bytes - i.e. if string terminated at 1...31 bytes from the page end.

The code won’t fail   :P

To put your mind at ease, this test the last 32 bytes
Code: [Select]
    .if VirtualAlloc(0,4096,MEM_COMMIT,PAGE_READWRITE)
mov ebx,eax
mov edi,eax
mov ecx,4096
mov eax,'x'
rep stosb
mov edi,4096-15
add ebx,15
.repeat
dec edi
mov byte ptr [ebx+edi],0
push ebx
call esi
mov esp,ebp
.until edi == 4096 - 33 - 15
sub ebx,15
invoke  VirtualFree,ebx,0,MEM_RELEASE
    .endif

The timings for the new algo:
Code: [Select]
AMD Athlon(tm) II X2 245 Processor (SSE3)
----------------------------------------------
-- string length 0..40
844627    cycles - (  0) proc_0: crt_strlen
1120048   cycles - (  0) proc_1: len()
832447    cycles - (  0) proc_2: Len()
497990    cycles - (104) proc_5: SSE 32 - safe
620398    cycles - (104) proc_6: AxStrLenSSE
-- string length 40..80
937328    cycles - (  0) proc_0: crt_strlen
1108642   cycles - (  0) proc_1: len()
542569    cycles - (  0) proc_2: Len()
340045    cycles - (104) proc_5: SSE 32 - safe
474865    cycles - (104) proc_6: AxStrLenSSE
-- string length 600..1000
510733    cycles - (  0) proc_0: crt_strlen
761988    cycles - (  0) proc_1: len()
171665    cycles - (  0) proc_2: Len()
111466    cycles - (104) proc_5: SSE 32 - safe
162919    cycles - (104) proc_6: AxStrLenSSE

result:
   949501 cycles - (104) proc_5: SSE 32 - safe
  1258182 cycles - (104) proc_6: AxStrLenSSE
  1546681 cycles - (  0) proc_2: Len()
  2292688 cycles - (  0) proc_0: crt_strlen
  2990678 cycles - (  0) proc_1: len()

jj2007

  • Member
  • *****
  • Posts: 11575
  • Assembler is fun ;-)
    • MasmBasic
Re: Faster Memcopy ...
« Reply #51 on: March 20, 2015, 01:47:04 AM »
Jochen,
your "example" is WRONG. You're passing NOT PROPER string to the algo - and you should not wait for proper results.

Alex,

My example is certainly problematic, and I agree that such an "improperly delimited" string should not be passed to a Windows API that expects a zero-delimited string; which, for example, is not the case for crt_lcpyn:

  Let esi=Space$(4097)
  invoke lstrcpyn, esi, edi, Len(edi)
  Print "[", esi, "]"


OTOH, the string is perfectly valid in a MasmBasic macro such as Let or Print (which only want to know how many bytes must be copied):

  Print Str$("MasmBasic says the length of the string is %i bytes\n", Len(edi))      ; Len() WORKS OK
  Let esi="The string: ["+edi+"]"
  PrintLine esi
  PrintLine "[", edi, "]"


Nonetheless, I do not encourage SEH (that is what Init tc does) to enable problematic examples like this.

Antariy

  • Member
  • ****
  • Posts: 557
Re: Faster Memcopy ...
« Reply #52 on: March 20, 2015, 06:01:37 AM »
The code won’t fail   :P

To put your mind at ease, this test the last 32 bytes


You may leave your sarcasm for yourself :P

Yes, it will not fail, because you did aligned data fetch to 32 bytes - but when I was first replying to your code sample, you provided only the inner loop logic, without other code, which shows that you do align to 32. The second time I took attention on the way you do first checking but did not noticed that you align the buffer fetch to 32, so that's why I said that it will fail - it should have fail if it will fetch data with 32 bytes but if the buffer fetch fill be aligned to only 16 bytes, but, as already said, I just did not noticed the proper alignment when replied second time, and there was not full code the first time from your side.
But, again, nothing wrong with that I was saying that in my first post in the thread I was describing the proper way of implementation of the algo, there I was written for an instance this sentence
Quote
It is simpler to align the pointers to the power of two - so you may freely read and assume that the code will not crash - and it will not crash if you will check every read for match to the algos purpose.

You just did it right (aligned to the power of two, equal to the size of data fetch, the same as I was writing, too - that you should align and use proper data size - QWORD/DWORD/WORD/BYTE etc - you just did that with "DXMMWORD" (double XMM word) - so nothing wrong with my explanation).

The source you provided is not recompilable, probably set of your custom macroses is missed there, so I did not run the test itself, but for sure it will work.
Just a thing: which code did you used in the test? Or you do load of the code from the .bin files? Just haven't found for an instance your code which you posted earlier there (which aligns to 32) in the disassembly listing ::)

Antariy

  • Member
  • ****
  • Posts: 557
Re: Faster Memcopy ...
« Reply #53 on: March 20, 2015, 06:45:08 AM »
Jochen,

My example is certainly problematic, and I agree that such an "improperly delimited" string should not be passed to a Windows API that expects a zero-delimited string; which, for example, is not the case for crt_lcpyn:

  Let esi=Space$(4097)
  invoke lstrcpyn, esi, edi, Len(edi)
  Print "[", esi, "]"


Well, lstrcpyn is more like the memcpy function with one additional stop factor - zero byte, because it has the length specificator. So it's not really well valid example. More over for some reason the lstrXXXn functions break the zero-terminated logic itself: these functions break the string in some unexpected place, so, for an instance, in some cases you might receive the partial string, but this may not look like proper result, as you of course understand. These functions act more like those "hiders" of error - the code actually copied just a part of the string but the proggie continues to use that string thinking that it's proper, full string with full information it must hold. Of course, the truncated string contains not full info, so it may be treated just as a trash, not the data. IF the string was just some message etc - this may be "acceptable" to receive just a truncated message on the display, but if the original string which was copied to the smaller buffer and got truncated was containing some important data (file name, just a file system path, name of some object, an encoded (for an instance a short base64 message) form of some other data etc etc) - then this case is obviously unacceptable because you get not the data, but a trash.
In your example you might substitute the lstrcpyn with memcopy+zero termination for an instance, since the work of handling the exception and determining the "length" of the string anyway was done by MasmBasic.
But that did not change things, actually: your code got an improper string, the code does not know for which reason the string was improper - it was not terminated and continued until the end of the page. The code should flag this, instead of just continuing to work as if the string is properly terminated. The rule of programming is that it's better to have do not working program, that working improperly. So it's better for code to detect that it got improper zero terminated string and flag it in some way (to the user or just going other way/return etc). Of course, this applicable to the code which expects a ZERO TERMINATED strings - if the algo which should receive the zero-terminated strings receives not terminated string, it should crash or flag it. But the algos which do not require zero-terminated strings to work, but rely on the "buffer address" and "known size" params - may work with no difference if there is zero terminator, but just because they by design do not require zero terminated strings, but rather binary strings.
Your MB Len algo is proper for the MB strings, but in a case you pass a string which is known to be zero terminated - you should not accept silently those claimed as "zero terminated" but not terminated actually strings, that was the point I tried to describe.

Anyway the problem of taking the unproper zero terminated strings with the algos which do not flag that and reject further processing is that if you continue, you might continue to receive more and more such strings, but SEH is very greedy in therms of timings, so any time you receive such a string you have thousands of cycles stalls just to continue work after a SEH catch.
And, of course, if you will pas that string which is not actually terminated to any other code - it may crash there. Other way you may do - copy that improper string to proper MB string and pass that string to the algos, but, again, every improper string copy will cost thousands of cycles.

OTOH, the string is perfectly valid in a MasmBasic macro such as Let or Print (which only want to know how many bytes must be copied):

  Print Str$("MasmBasic says the length of the string is %i bytes\n", Len(edi))      ; Len() WORKS OK
  Let esi="The string: ["+edi+"]"
  PrintLine esi
  PrintLine "[", edi, "]"


Nonetheless, I do not encourage SEH (that is what Init tc does) to enable problematic examples like this.

BTW, what means OTOH acronym?

Yes, I did not tell you that the MASMBASIC strings are wrong if they are not terminated with zero, I told about ZERO TERMINATED strings. If MB algo receives zero terminated string which is not terminated - then it should obviously report you in some way that the string is not terminated... for an instance you may return the "seem ok" string length in EAX, but -1 in EDX if the string was not terminated and zero in EDX if it was proper zero terminated string.

The installed SEH covers all your algos in the library? I.e. the Let, the Len algos are not crash with that improper zero-terminated string only if you install the SEH at the Init? If you will not - the both Len and Let will crash with that string? How do you implemented SEH coverage on your lib - as an UnhandledExceptionFilter or as a per-thread SEH frames (just like in my signature)? The UnhandledExceptionFilter does not provide a way to handle exceptions in multiple-threaded applications, because the "safe address" which you may use to restore execution might be overwritten in the other thread which enters some function which set that "safe address" up with its own address of restoration.

nidud

  • Member
  • *****
  • Posts: 2240
    • https://github.com/nidud/asmc
Re: Faster Memcopy ...
« Reply #54 on: March 20, 2015, 07:17:22 AM »
You may leave your sarcasm for yourself :P

 :biggrin:

Quote
Yes, it will not fail, because you did aligned data fetch to 32 bytes - but when I was first replying to your code sample, you provided only the inner loop logic, without other code, which shows that you do align to 32.

That code did actually fail. All these versions are included in the strlen.zip file (#50). I converted a string library some time ago and the code that read-ahead is used in many of these functions.

Quote
The source you provided is not recompilable, probably set of your custom macroses is missed there, so I did not run the test itself, but for sure it will work. Just a thing: which code did you used in the test? Or you do load of the code from the .bin files?

I uploaded the test-bed (benchmarks.zip) used here (#34).

Antariy

  • Member
  • ****
  • Posts: 557
Re: Faster Memcopy ...
« Reply #55 on: March 20, 2015, 07:31:40 AM »
You may leave your sarcasm for yourself :P

 :biggrin:

:t

Quote
Yes, it will not fail, because you did aligned data fetch to 32 bytes - but when I was first replying to your code sample, you provided only the inner loop logic, without other code, which shows that you do align to 32.

That code did actually fail. All these versions are included in the strlen.zip file (#50). I converted a string library some time ago and the code that read-ahead is used in many of these functions.

Ah, so at least one of the pieces discusset failed. You mean this algo, probably?
Code: [Select]
.686
.model flat, stdcall
.code
.xmm

OPTION PROLOGUE:NONE, EPILOGUE:NONE

strlen  proc string
mov ecx,[esp+4] ; 4
xorps xmm2,xmm2 ; 3
push edx ; 1
@@:
movdqu  xmm1,[ecx+16]
movdqu  xmm0,[ecx]
add ecx,32
pcmpeqb xmm0,xmm2
pcmpeqb xmm1,xmm2
pmovmskb edx,xmm0
pmovmskb eax,xmm1
shl eax,16
or eax,edx
jz @B
pop edx
bsf eax,eax
lea eax,[ecx+eax-32]
sub eax,[esp+4]
ret 4
strlen  endp

END

It's in 4.asm, and the only algo which reads unaligned so it crashed near end of the page even if it did reads by 16 byte chunks, but it more over does two reads of 32 bytes. So first time you posted two-reading piece of this code?

Quote
Quote
The source you provided is not recompilable, probably set of your custom macroses is missed there, so I did not run the test itself, but for sure it will work. Just a thing: which code did you used in the test? Or you do load of the code from the .bin files?

I uploaded the test-bed (benchmarks.zip) used here (#34).

Ah, got it earlier but did not got that it was full set of the files. the dz.exe is your file manager DosZip?
But, the other question: do you load algos from binary files? Just did not found your safe 32 bytes algo in the listing of the timeit.exe.

nidud

  • Member
  • *****
  • Posts: 2240
    • https://github.com/nidud/asmc
Re: Faster Memcopy ...
« Reply #56 on: March 20, 2015, 08:22:02 AM »
Ah, so at least one of the pieces discusset failed. You mean this algo, probably?

Yes, there are many like this around, and here are some more  :lol:

Quote
It's in 4.asm, and the only algo which reads unaligned so it crashed near end of the page even if it did reads by 16 byte chunks, but it more over does two reads of 32 bytes.

The same also with 4 and 8.

Quote
So first time you posted two-reading piece of this code?
Think I posted it somewhere but I'm not sure.

Quote
Ah, got it earlier but did not got that it was full set of the files. the dz.exe is your file manager DosZip?
Yes, plus assembler and make to simplify the testing. If on the same drive as masm32 it should work out of the box.

Quote
But, the other question: do you load algos from binary files? Just did not found your safe 32 bytes algo in the listing of the timeit.exe.

Yes, for reasons explained here.
« Last Edit: November 17, 2019, 03:06:01 AM by nidud »

jj2007

  • Member
  • *****
  • Posts: 11575
  • Assembler is fun ;-)
    • MasmBasic
Re: Faster Memcopy ...
« Reply #57 on: March 20, 2015, 02:14:53 PM »
BTW, what means OTOH acronym?
on the other hand

Quote
The installed SEH covers all your algos in the library? I.e. the Let, the Len algos are not crash with that improper zero-terminated string only if you install the SEH at the Init?

If SEH is installed (and I rarely use that feature), then there is exactly one point in the library where it is being used: MbStrLen. Which is a function that is used internally very often, of course.

rrr314159

  • Member
  • *****
  • Posts: 1382
Re: Faster Memcopy ...
« Reply #58 on: March 20, 2015, 09:15:14 PM »
Hello Antariy, nice to meet you,

There is no difference in the "MEM" or "STR" functions - just imagine your "MEM" function received wrong bytecount to scan/copy/etc - and if the buffer has actually the size smaller than it was specified with bytecount param, and the buffer is lying near to the end of the memory page - the "MEM" code will crash as well as the "STR" code.

- If the user makes an incorrect request (wrong bytecount) then perhaps the algo should just crash. If he's careless enough to do that he probably won't check error code return, so how do you tell him he made a mistake? If you cover for him, it will probably nail him later somewhere else. Perhaps, let it crash so he knows the error. That's how my memcopy behaves.

But, that's a matter of opinion. The important point is, if user gives correct bytecount, (or, correctly terminated C-style zero-STR), then you must not crash by reading / writing outside the requested buffers. Even though reading 128 bytes at a time, my memcopy never goes beyond user-specified bounds.

It is simpler to align the pointers to the power of two...

- You mean, align to the chunk size, assuming it's a power of 2. (The "size of the data grabbing" is called "chunk".)

The loud cries about "the algo is faster instead" is not excusable - the algo should be reliable first, and then it MAY be thought of making it faster.

- Of course! Surely Lingo agrees?

Sometimes I agree with the topic paster on the thought that the more descriptive and full information one tries to provide - the less respect and carefull attention to the one's descriptions/writings one haves.

- Very annoying, go to a lot of trouble with the language and then no one reads it! But, minor point: actually, I didn't paste the topic. Couldn't find anywhere on the net to copy it from, so I had to write it myself :P

Once again, this algo will not fail with ANY zero-terminated string located anywhere in the memory: ...

- Right, but it can be 18 bytes shorter, and a little cleaner:

Code: [Select]
OPTION PROLOGUE:NONE
OPTION EPILOGUE:NONE
align 16
AxStrLenSSE proc STDCALL lpszStr:DWORD ; SSE safe, ECX safe :), fast with unaligned strings
mov edx,[esp+4]
; mov eax,[esp+4]                   ; don't get eax from [esp+4]
        mov eax, edx                     ; but from edx, save 2 bytes
; db 8bh,94h,24h,04,0,0,0           ; don't pad 6 bytes with
; db 8bh,84h,24h,04,0,0,0           ; these two db instructions
add esp,-14h
movups [esp],xmm7
mov [esp+16],ecx
and edx,-10h                      ; don't pad 3 bytes
; db 81h,0e2h,0F0h,0FFh,0FFh,0FFh   ; with this db instruction
mov ecx,eax
pxor xmm7,xmm7
and ecx,0fh
; jz @F                             ; no jump save 2 bytes
pcmpeqb xmm7,[edx]
; add edx,10h                       ; don't add here save 3 bytes
pmovmskb eax,xmm7
shr eax,cl                         ; works fine if cl is 0
bsf eax,eax
jnz @ret
pxor xmm7,xmm7
@@: ; aligned to 16                ; still aligned but 16 bytes less
add edx,16                         ; switch order of these
pcmpeqb xmm7,[edx]                 ; two instructions
pmovmskb eax,xmm7
test eax,eax
jz @B

bsf eax,eax
sub edx,[esp+4+16+4]
; lea eax,[eax+edx-10h]
        add eax, edx                    ;  don't need -10h any more, save 2 bytes

@ret:
movups xmm7,[esp]
mov ecx,[esp+16]
add esp,14h
ret 4

AxStrLenSSE endp
OPTION PROLOGUE:PrologueDef
OPTION EPILOGUE:EpilogueDef

Here it is without all the extra comments:

Code: [Select]
OPTION PROLOGUE:NONE
OPTION EPILOGUE:NONE
align 16
AxStrLenSSE proc STDCALL lpszStr:DWORD ; SSE safe, ECX safe :), fast with unaligned strings

    mov edx,[esp+4]
    mov eax, edx
    add esp,-14h
    movups [esp],xmm7
    mov [esp+16],ecx
    and edx,-10h
    mov ecx,eax
    pxor xmm7,xmm7
    and ecx,0fh
    pcmpeqb xmm7,[edx]
    pmovmskb eax,xmm7
    shr eax,cl
    bsf eax,eax
    jnz @ret
   
    pxor xmm7,xmm7
    @@:                                ; naturally aligned to 16
        add edx,16
        pcmpeqb xmm7,[edx]
        pmovmskb eax,xmm7
        test eax,eax
        jz @B
   
    bsf eax,eax
    sub edx,[esp+4+16+4]
    add eax, edx
   
    @ret:
    movups xmm7,[esp]
    mov ecx,[esp+16]
    add esp,14h
    ret 4

AxStrLenSSE endp
OPTION PROLOGUE:PrologueDef
OPTION EPILOGUE:EpilogueDef

- BTW your English is very understandable, much better than I write your language! c u later
I am NaN ;)

Antariy

  • Member
  • ****
  • Posts: 557
Re: Faster Memcopy ...
« Reply #59 on: March 21, 2015, 11:08:43 PM »
BTW, what means OTOH acronym?
on the other hand

Quote
The installed SEH covers all your algos in the library? I.e. the Let, the Len algos are not crash with that improper zero-terminated string only if you install the SEH at the Init?

If SEH is installed (and I rarely use that feature), then there is exactly one point in the library where it is being used: MbStrLen. Which is a function that is used internally very often, of course.

How did you implement the exception handling - as per-thread SEH or SetUnhandledExceptionFilter? IIRC it was SUEF?




Hello Antariy, nice to meet you,

Hello rrr, nice to meet you, too.

Hello Antariy, nice to meet you,

There is no difference in the "MEM" or "STR" functions - just imagine your "MEM" function received wrong bytecount to scan/copy/etc - and if the buffer has actually the size smaller than it was specified with bytecount param, and the buffer is lying near to the end of the memory page - the "MEM" code will crash as well as the "STR" code.

- If the user makes an incorrect request (wrong bytecount) then perhaps the algo should just crash. If he's careless enough to do that he probably won't check error code return, so how do you tell him he made a mistake? If you cover for him, it will probably nail him later somewhere else. Perhaps, let it crash so he knows the error. That's how my memcopy behaves.

But, that's a matter of opinion. The important point is, if user gives correct bytecount, (or, correctly terminated C-style zero-STR), then you must not crash by reading / writing outside the requested buffers. Even though reading 128 bytes at a time, my memcopy never goes beyond user-specified bounds.

The algo may not know if the user is careless or not - it is "just code" ("just a machine") and has no rights to make assumptions about those who use it. Imagine the same with, for an instance, with some protection scheme: the user sits before the computer, turning it on, the computer says "I diskike your face today, probably you're drunk or you're not the owner of this machine, maybe you're a criminal, let's format harddrive of this machine to protect the real owner from data steal".

The fact that the "ideology" of "protected programming" was perversed by many contemporal programmers who are careless and simultaneously with carelessness use the "protected programming" techniques, do not make the "ideology" itself wrong. The code is better to behave in predictable, fully documented and robust way - and the rest is for the user's decision. The routines are just tools, aren't they? One wants to use as much as possible useful, comfortable and robust tool, so, if there is such tool - it's good. But HOW the tool will be used and which product it will produce is for the "master" who uses this tool.

So, probably nothing is wrong with "protected programming" nor "SEH" etc. - this is problem of people psychic rather than the "ideology" - the careless and sloppy coders produce problems, not the code itself.

The code may not know if the user provided a wrong bytecount or the buffer for some reason was not such a size as the user thinks (this may really occur - for an instance with memory-mapped files / sections while the data was not loaded for some reason (disk unexpectedly became full or just hardware failure) into the memory page) but the bytecount was right, so, if the user was careless enough to not notice that the system did not provide the full data and passed the pointer to the code, it may have one more opportunity to know that there is a error, especially if the error-flagging return differes from the regular return very noticeable. Also the itself system bugs may not be discarted from the count - even if the system provided wrong result, the user still might notice it with the code which reports about it.
After all, yes, the "protected programming" implies that one should rely on the system as a "standard" which has no bugs, and the coding style is for decision of the programmer. There are, as usual, two biggest camps: those, who codes absolutely sloppy, and those who codes paranoidally "carefull" - both camps produce usually the most buggy products on the planet and usually are the most big and respective companies on the planet :greensml:



It is simpler to align the pointers to the power of two...

- You mean, align to the chunk size, assuming it's a power of 2. (The "size of the data grabbing" is called "chunk".)

Yes, this correction is right and describes what I wanted to say more clearly.

The loud cries about "the algo is faster instead" is not excusable - the algo should be reliable first, and then it MAY be thought of making it faster.

- Of course! Surely Lingo agrees?

I do not know with which things does he agree and more overy do not want to know that :greensml: But from his "coding style" many of our members are probably got an opinion that he is a megalomaniacal self-named "God of assembly" and a "Code Nazy" who first on the planet invented every thing and all other are stolen that from him. But he is also known to produce buggy but "fast working" code and when he gets the notes on that he just goes to the insulting others, so, I think, it's better to ask his psychiatrist - what he agrees with :greensml:

Sometimes I agree with the topic paster on the thought that the more descriptive and full information one tries to provide - the less respect and carefull attention to the one's descriptions/writings one haves.

- Very annoying, go to a lot of trouble with the language and then no one reads it! But, minor point: actually, I didn't paste the topic. Couldn't find anywhere on the net to copy it from, so I had to write it myself :P

Well, I think, if the jokes apart, you understand that did not mean "copy"-"paster". You "pasted" the topic on the forum - there was no such topic, but it appears - it was "pasted" there, so, the one who did it was topic-"paster".
The word "paste" is independed word from the "copy" or from the "therm" "copy-paste"? So, then, that was pretty right said :P

Once again, this algo will not fail with ANY zero-terminated string located anywhere in the memory: ...

- Right, but it can be 18 bytes shorter, and a little cleaner:

How it was written has reasons, for an instance the XMM and ECX preservation was the "condition of production" - Jochen's condition since his lib preserves those regs in its internal functions - it's MasmBasic's ABI.
Other things like "wide" instructions was used to padd the inner loop to align it by 16, yes, but here it was for reason, too, because it was to fit the code as it was written - with all the specifics. The code may be shorter, yes, but as it was written was seemed (by me, of course, so this may be called as a "subjective judge") it was most performant on a widest variety of hardware. Some our members may confirm that I like to "hunt for bytes" like a maniac sometimes :greensml:, but in this case it was not a case (the pun).

I will just comment the changed code in the lines below of the changes - without indent, why it was written so.

Code: [Select]
OPTION PROLOGUE:NONE
OPTION EPILOGUE:NONE
align 16
AxStrLenSSE proc STDCALL lpszStr:DWORD ; SSE safe, ECX safe :), fast with unaligned strings
mov edx,[esp+4]
; mov eax,[esp+4]                   ; don't get eax from [esp+4]
        mov eax, edx                     ; but from edx, save 2 bytes
; db 8bh,94h,24h,04,0,0,0           ; don't pad 6 bytes with
; db 8bh,84h,24h,04,0,0,0           ; these two db instructions

The code which was original will work faster because of reference to the same memory location, which,
being loaded to the cache (if it was not yet) when first accessed, is the "fast to do" thing, rather
than dependent and thus stalling instructions when the value taken into one reg and then copied to the
second reg from first one.

If you want to save those two bytes and leave these instructions as they were coded, then just replace
pxor instructions with the xorps.

add esp,-14h
movups [esp],xmm7
mov [esp+16],ecx

Actually if I will have to descrease the size of the algo I will probably remove the XMM and ECX presevation,
because it's not APIs ABI but only MasmBasic ABI.

and edx,-10h                      ; don't pad 3 bytes
; db 81h,0e2h,0F0h,0FFh,0FFh,0FFh   ; with this db instruction

This was coded so to put loop to 16 bytes align, why the code was in a whole coded so may be understandable
only if it is seeing as a whole but not some different pieces - read below why it was decided as unappropriate
to write the code so as it was changed in this example.

mov ecx,eax
pxor xmm7,xmm7
and ecx,0fh
; jz @F                             ; no jump save 2 bytes

The CPU "predicts" the "forward" jumps as "probably untaken", so, the jump will not be taken with the
possibility 16:1 (unaligned string) with byte-aligned strings, or 8:1 with even aligned strings, or 4:1 with
dword aligned strings, so, the jump and the some slowdown is more likely to not happen. And even if it happens
the slowdown itself is not big (the high numbers of jumps are exaggerated really, and as the jump is near and in the same code
location actually, it maybe assumed that the jump will not lead to any "misprediction" stalls since the code
is already fetched and decoded, even that code where the jump jumps (the pun again) to). So, it will cost only few cycles,
but taking in account that it over-jumps some instructions, which will take some time, this jump in the real world
(not millions of iterations on the same data as it is doing with our testbeds) will not only be a slowdown but will
rather speed the code up.

pcmpeqb xmm7,[edx]
; add edx,10h                       ; don't add here save 3 bytes

It will be described below why the pointer advancing was done after the data fetch.

pmovmskb eax,xmm7
shr eax,cl                         ; works fine if cl is 0

The jump was done not to "protect" from zero ECX but to not execute unrequired and slowing down code without reason.

bsf eax,eax
jnz @ret
pxor xmm7,xmm7
@@: ; aligned to 16                ; still aligned but 16 bytes less
add edx,16                         ; switch order of these
pcmpeqb xmm7,[edx]                 ; two instructions

The pointer advancing BEFORE data fetching slows down the code, especially on earlier than modern CPUs.
The mov eax,[edx]; add edx,16 will be generally always faster than add edx,16; mov eax,[edx]
because these two instructions in first case pair (memory access goes into one block and the arithmetic operation
goes into second block) and not pair in second case.

pmovmskb eax,xmm7
test eax,eax
jz @B

bsf eax,eax
sub edx,[esp+4+16+4]
; lea eax,[eax+edx-10h]
        add eax, edx                    ;  don't need -10h any more, save 2 bytes
       
It was needed because the pointer advancing was specially chosen to be done after data fetching.

@ret:
movups xmm7,[esp]
mov ecx,[esp+16]
add esp,14h
ret 4

AxStrLenSSE endp
OPTION PROLOGUE:PrologueDef
OPTION EPILOGUE:EpilogueDef

- BTW your English is very understandable, much better than I write your language! c u later

Well, it was not so some years ago, but some forum members also did assure me that it is pretty understandable, even being pretty "Cyrillic English" as one of the forum friends called it.
Do you write Russian?



Ah, so at least one of the pieces discusset failed. You mean this algo, probably?

Yes, there are many like this around, and here are some more  :lol:

You did point to that size for some reason (maybe it was some useful on that page? I did not read it thorougly), or just as an example of the fact that there are many implementations which do not care about possibility of the crash (do not align the fetch, and even grab the data unaligned all the time)?


Quote
It's in 4.asm, and the only algo which reads unaligned so it crashed near end of the page even if it did reads by 16 byte chunks, but it more over does two reads of 32 bytes.

The same also with 4 and 8.

Yes, but I told about SSE one and just did not mention other.


Quote
So first time you posted two-reading piece of this code?
Think I posted it somewhere but I'm not shore.

Just on the previous page there was a piece which included the only inner loop of the two-reading and one-checking algo. My first post here fast after that where I first said that the problem of the algo is in two reads and one check - since you provided only inner loop and not full algo first time, I assumed you do the reads unaligned/aligned to only 16.



Quote
Ah, got it earlier but did not got that it was full set of the files. the dz.exe is your file manager DosZip?
Yes, plus assembler and make to simplify the testing. If on the same drive as masm32 it should work out of the box.

OK, good. Assembler is JWasm?


Quote
But, the other question: do you load algos from binary files? Just did not found your safe 32 bytes algo in the listing of the timeit.exe.

Yes, for reasons explained here.

Ah, well known code location problem. But you might try to implement the solution simpler - with definition of different segment for every tested algo. This will allow to run the not "code location independed" algos as well, with no need to relocate them in some way manually. The algos with jump/call tables are those "relocation required" algos, for an instance. Or the algos which do not-near relative/direct calls/jumps.