You may safely read/write buffers not in byte-by-byte fashion.
There is no difference in the "MEM" or "STR" functions - just imagine your "MEM" function received wrong bytecount to scan/copy/etc - and if the buffer has actually the size smaller than it was specified with bytecount param, and the buffer is lying near to the end of the memory page - the "MEM" code will crash as well as the "STR" code.
When you scan the memory you may do this safely with larger than byte-sized reads/writes, for this you just should take attention on the details how memory is allocated under x86 page addressed protected mode (Windows, Linux, Mac and any other protected mode operating system on x86).
The memory has the granularity of one page - 4096 bytes "at least", so you may always assume, that if you was given with some address inside memory page, there is full page in the memory available for read/write (depends on protection).
I.e., if you was given with an address 23987234h - just for an instance - so you may take the base address of the page, within which this address located - just clear 12 lowest bits of the address (AND EAX,-4096) - and you have the page address - for this example it will be 23987000h . So, now you know the start of the page, and you know that the size of page is 4096 bytes, so from that you may know how long is the buffer AT LEAST - 23987000h + 1000h = 23988000h. Within this range (from the given start of the buffer 23987234h to 23988000h) you may freely read the data in any sized fashion - just take attention that your reads do not read beyond that. I.e. you may freely start the single-reg XMM read (16 bytes) on 23987ff0h, but you obviously may not start the same read at the 23987ff1h - because final address of data grabbed 23987ff1h + 10h = 23988001h - so it is beyoud the page range and you may not be sure that the next page as reserved/commited in the memory.
Just take attention on at which adddresses you read/write, and, depending on the size of the data grabbing, you have all the info how to not read beyond the buffer to the inaccessible memory.
It is simpler to align the pointers to the power of two - so you may freely read and assume that the code will not crash - and it will not crash if you will check every read for match to the algos purpose.
As an example - just imagine that you use a byte-by-byte StrLen reading algo and SSE algo which uses single-reg XMM read. If you pass the properly formatted zero-terminated string to the algos - both algos will work correctly if they check the readed data to match the finish condition (the byte zero), before they read next bytes. If the both algos will take the wrong formatted strings without zero termanator until very end of the page - and no accessible area after that - the both algos will crash. No difference between algos.
What is wrong with that example of SSE code which does reading with two XMM regs: that code does TWO readings and ONE check after both readings. So, actually the code may read the proper string end (zero byte) with first read from both reads, but it will with no attention to that try to read the next 16 bytes - but the proper zero terminated string was already terminated in the previous 16 bytes (first XMM read) - so, being reading the PROPER zero terminated string at the PROPER possible location near to the end of the buffer with no accessible memory beyond that, this code will crash, that is why it is IMPROPER. We had a thread about StrLen XMM algo on the old forum, there were I, Jochen and Lingo as the main posters AFAIK, and Lingo provided the algo of such kind - which did read two XMM reads, then combined the mask of bytes result in one 32 bit GPR reg, and only after that did veryfied the result. Of course that code was buggy, and it is funny and shame (taking in account his megalomania) for Lingo he doesn't know such a simple and basic things like the buffers/memory layouts. The loud cries about "the algo is faster instead" is not excusable - the algo should be reliable first, and then it MAY be thought of making it faster. So Jochen decided to use one-reading XMM algo in his lib, and that was simplest and reliable solution. It is probably possible to do complex solution which will read more that one reg - but it will have to check every reg before it will have the right to read the next reg from the memory, so actually that complex code probably not only will not be faster than one-reg-reading code, it will be much slower. So Lingo's code was just a trash from the point of view of software production - it was just real fancy thing but not real working solution. Well, after so long time I described my point of view on the "Lingo's coding style" and the real "value" of his over-valuated "style", in proper English now
