BTW, what means OTOH acronym?
on the other hand
The installed SEH covers all your algos in the library? I.e. the Let, the Len algos are not crash with that improper zero-terminated string only if you install the SEH at the Init?
If SEH is installed (and I rarely use that feature), then there is exactly one point in the library where it is being used: MbStrLen. Which is a function that is used internally very often, of course.
How did you implement the exception handling - as per-thread SEH or SetUnhandledExceptionFilter? IIRC it was SUEF?
Hello Antariy, nice to meet you,
Hello rrr, nice to meet you, too.
Hello Antariy, nice to meet you,
There is no difference in the "MEM" or "STR" functions - just imagine your "MEM" function received wrong bytecount to scan/copy/etc - and if the buffer has actually the size smaller than it was specified with bytecount param, and the buffer is lying near to the end of the memory page - the "MEM" code will crash as well as the "STR" code.
- If the user makes an incorrect request (wrong bytecount) then perhaps the algo should just crash. If he's careless enough to do that he probably won't check error code return, so how do you tell him he made a mistake? If you cover for him, it will probably nail him later somewhere else. Perhaps, let it crash so he knows the error. That's how my memcopy behaves.
But, that's a matter of opinion. The important point is, if user gives correct bytecount, (or, correctly terminated C-style zero-STR), then you must not crash by reading / writing outside the requested buffers. Even though reading 128 bytes at a time, my memcopy never goes beyond user-specified bounds.
The algo may not know if the user is careless or not - it is "just code" ("just a machine") and has no rights to make assumptions about those who use it. Imagine the same with, for an instance, with some protection scheme: the user sits before the computer, turning it on, the computer says "I diskike your face today, probably you're drunk or you're not the owner of this machine, maybe you're a criminal, let's format harddrive of this machine to protect the real owner from data steal".
The fact that the "ideology" of "protected programming" was perversed by many contemporal programmers who are careless and simultaneously with carelessness use the "protected programming" techniques, do not make the "ideology" itself wrong. The code is better to behave in predictable, fully documented and robust way - and the rest is for the user's decision. The routines are just tools, aren't they? One wants to use as much as possible useful, comfortable and robust tool, so, if there is such tool - it's good. But HOW the tool will be used and which product it will produce is for the "master" who uses this tool.
So, probably nothing is wrong with "protected programming" nor "SEH" etc. - this is problem of people psychic rather than the "ideology" - the careless and sloppy coders produce problems, not the code itself.
The code may not know if the user provided a wrong bytecount or the buffer for some reason was not such a size as the user thinks (this may really occur - for an instance with memory-mapped files / sections while the data was not loaded for some reason (disk unexpectedly became full or just hardware failure) into the memory page) but the bytecount was right, so, if the user was careless enough to not notice that the system did not provide the full data and passed the pointer to the code, it may have one more opportunity to know that there is a error, especially if the error-flagging return differes from the regular return very noticeable. Also the itself system bugs may not be discarted from the count - even if the system provided wrong result, the user still might notice it with the code which reports about it.
After all, yes, the "protected programming" implies that one should rely on the system as a "standard" which has no bugs, and the coding style is for decision of the programmer. There are, as usual, two biggest camps: those, who codes absolutely sloppy, and those who codes paranoidally "carefull" - both camps produce usually the most buggy products on the planet and usually are the most big and respective companies on the planet

It is simpler to align the pointers to the power of two...
- You mean, align to the chunk size, assuming it's a power of 2. (The "size of the data grabbing" is called "chunk".)
Yes, this correction is right and describes what I wanted to say more clearly.
The loud cries about "the algo is faster instead" is not excusable - the algo should be reliable first, and then it MAY be thought of making it faster.
- Of course! Surely Lingo agrees?
I do not know with which things does he agree and more overy do not want to know that

But from his "coding style" many of our members are probably got an opinion that he is a megalomaniacal self-named "God of assembly" and a "Code Nazy" who first on the planet invented every thing and all other are stolen that from him. But he is also known to produce buggy but "fast working" code and when he gets the notes on that he just goes to the insulting others, so, I think, it's better to ask his psychiatrist - what he agrees with

Sometimes I agree with the topic paster on the thought that the more descriptive and full information one tries to provide - the less respect and carefull attention to the one's descriptions/writings one haves.
- Very annoying, go to a lot of trouble with the language and then no one reads it! But, minor point: actually, I didn't paste the topic. Couldn't find anywhere on the net to copy it from, so I had to write it myself :P
Well, I think, if the jokes apart, you understand that did not mean "copy"-"paster". You "pasted" the topic on the forum - there was no such topic, but it appears - it was "pasted" there, so, the one who did it was topic-"paster".
The word "paste" is independed word from the "copy" or from the "therm" "copy-paste"? So, then, that was pretty right said :P
Once again, this algo will not fail with ANY zero-terminated string located anywhere in the memory: ...
- Right, but it can be 18 bytes shorter, and a little cleaner:
How it was written has reasons, for an instance the XMM and ECX preservation was the "condition of production" - Jochen's condition since his lib preserves those regs in its internal functions - it's MasmBasic's ABI.
Other things like "wide" instructions was used to padd the inner loop to align it by 16, yes, but here it was for reason, too, because it was to fit the code as it was written - with all the specifics. The code may be shorter, yes, but as it was written was seemed (by me, of course, so this may be called as a "subjective judge") it was most performant on a widest variety of hardware. Some our members may confirm that I like to "hunt for bytes" like a maniac sometimes

, but in this case it was not a case (the pun).
I will just comment the changed code in the lines below of the changes - without indent, why it was written so.
OPTION PROLOGUE:NONE
OPTION EPILOGUE:NONE
align 16
AxStrLenSSE proc STDCALL lpszStr:DWORD ; SSE safe, ECX safe :), fast with unaligned strings
mov edx,[esp+4]
; mov eax,[esp+4] ; don't get eax from [esp+4]
mov eax, edx ; but from edx, save 2 bytes
; db 8bh,94h,24h,04,0,0,0 ; don't pad 6 bytes with
; db 8bh,84h,24h,04,0,0,0 ; these two db instructions
The code which was original will work faster because of reference to the same memory location, which,
being loaded to the cache (if it was not yet) when first accessed, is the "fast to do" thing, rather
than dependent and thus stalling instructions when the value taken into one reg and then copied to the
second reg from first one.
If you want to save those two bytes and leave these instructions as they were coded, then just replace
pxor instructions with the xorps.
add esp,-14h
movups [esp],xmm7
mov [esp+16],ecx
Actually if I will have to descrease the size of the algo I will probably remove the XMM and ECX presevation,
because it's not APIs ABI but only MasmBasic ABI.
and edx,-10h ; don't pad 3 bytes
; db 81h,0e2h,0F0h,0FFh,0FFh,0FFh ; with this db instruction
This was coded so to put loop to 16 bytes align, why the code was in a whole coded so may be understandable
only if it is seeing as a whole but not some different pieces - read below why it was decided as unappropriate
to write the code so as it was changed in this example.
mov ecx,eax
pxor xmm7,xmm7
and ecx,0fh
; jz @F ; no jump save 2 bytes
The CPU "predicts" the "forward" jumps as "probably untaken", so, the jump will not be taken with the
possibility 16:1 (unaligned string) with byte-aligned strings, or 8:1 with even aligned strings, or 4:1 with
dword aligned strings, so, the jump and the some slowdown is more likely to not happen. And even if it happens
the slowdown itself is not big (the high numbers of jumps are exaggerated really, and as the jump is near and in the same code
location actually, it maybe assumed that the jump will not lead to any "misprediction" stalls since the code
is already fetched and decoded, even that code where the jump jumps (the pun again) to). So, it will cost only few cycles,
but taking in account that it over-jumps some instructions, which will take some time, this jump in the real world
(not millions of iterations on the same data as it is doing with our testbeds) will not only be a slowdown but will
rather speed the code up.
pcmpeqb xmm7,[edx]
; add edx,10h ; don't add here save 3 bytes
It will be described below why the pointer advancing was done after the data fetch.
pmovmskb eax,xmm7
shr eax,cl ; works fine if cl is 0
The jump was done not to "protect" from zero ECX but to not execute unrequired and slowing down code without reason.
bsf eax,eax
jnz @ret
pxor xmm7,xmm7
@@: ; aligned to 16 ; still aligned but 16 bytes less
add edx,16 ; switch order of these
pcmpeqb xmm7,[edx] ; two instructions
The pointer advancing BEFORE data fetching slows down the code, especially on earlier than modern CPUs.
The mov eax,[edx]; add edx,16 will be generally always faster than add edx,16; mov eax,[edx]
because these two instructions in first case pair (memory access goes into one block and the arithmetic operation
goes into second block) and not pair in second case.
pmovmskb eax,xmm7
test eax,eax
jz @B
bsf eax,eax
sub edx,[esp+4+16+4]
; lea eax,[eax+edx-10h]
add eax, edx ; don't need -10h any more, save 2 bytes
It was needed because the pointer advancing was specially chosen to be done after data fetching.
@ret:
movups xmm7,[esp]
mov ecx,[esp+16]
add esp,14h
ret 4
AxStrLenSSE endp
OPTION PROLOGUE:PrologueDef
OPTION EPILOGUE:EpilogueDef
- BTW your English is very understandable, much better than I write your language! c u later
Well, it was not so some years ago, but some forum members also did assure me that it is pretty understandable, even being pretty "Cyrillic English" as one of the forum friends called it.
Do you write Russian?
Ah, so at least one of the pieces discusset failed. You mean this algo, probably?
Yes, there are many like this around, and here are some more :lol:
You did point to that size for some reason (maybe it was some useful on that page? I did not read it thorougly), or just as an example of the fact that there are many implementations which do not care about possibility of the crash (do not align the fetch, and even grab the data unaligned all the time)?
It's in 4.asm, and the only algo which reads unaligned so it crashed near end of the page even if it did reads by 16 byte chunks, but it more over does two reads of 32 bytes.
The same also with 4 and 8.
Yes, but I told about SSE one and just did not mention other.
So first time you posted two-reading piece of this code?
Think I posted it somewhere but I'm not shore.
Just on the previous page there was a piece which included the only inner loop of the two-reading and one-checking algo. My first post here fast after that where I first said that the problem of the algo is in two reads and one check - since you provided only inner loop and not full algo first time, I assumed you do the reads unaligned/aligned to only 16.
Ah, got it earlier but did not got that it was full set of the files. the dz.exe is your file manager DosZip?
Yes, plus assembler and make to simplify the testing. If on the same drive as masm32 it should work out of the box.
OK, good. Assembler is JWasm?
But, the other question: do you load algos from binary files? Just did not found your safe 32 bytes algo in the listing of the timeit.exe.
Yes, for reasons explained here.
Ah, well known code location problem. But you might try to implement the solution simpler - with definition of different segment for every tested algo. This will allow to run the not "code location independed" algos as well, with no need to relocate them in some way manually. The algos with jump/call tables are those "relocation required" algos, for an instance. Or the algos which do not-near relative/direct calls/jumps.