News:

Masm32 SDK description, downloads and other helpful links
Message to All Guests

Main Menu

Faster Memcopy ...

Started by rrr314159, March 03, 2015, 02:40:50 PM

Previous topic - Next topic

nidud

#30
deleted

dedndave

while that's true, to some extent

you can't avoid checking cases to get alignment   :P
that means some form of conditional branching

if you play with it, i think you'll find that JNZ is faster for reverse branches and JZ is faster for forward branches   :t
you can sometimes design your code to take advantage

you might also try a jump table - haven't tried that, but it might be faster

nidud

#32
deleted

rrr314159

Hi nidud,

This is a very interesting investigation, unfortunately I don't have time to get into it at the moment, but in a couple days I'd really like to do so. The following comments are (deliberately) w/o reference to anything but my organic memory subsystem, ...

First - since we're getting down to details - do a global search and replace on my algo, change "ebx" to "edx". Of course, has no effect on timing.

Now, your jump timings are right in effect - that is, the conclusion drawn is right: it's better to skip that branch. I noticed u did that but didn't bother to change my code, thinking it was meaningless; but seems it's more significant. I say "in effect" because in actuality, jmps take far fewer cycles than the numbers you give: 13.5 and, uh, (alrigght, I'll take a look ...) 30.1, 10.7 etc. Yet, those are reasonable Expected Values because of branch prediction. The reason for avoiding jumps is not that they're inherently that long, rather if branch is incorrectly predicted, they cost many cycles while pipeline is cleared and reloaded. So u can't come up with exact numbers w/o many more details. It's not only different for different companies (as u mention, AMD and Intel): it's different for the same processor, depending on recent history - even the very first time the branch is encountered - because other nearby branches can affect prediction for this branch. Thus a detailed analysis is inherently probabilistic. As I say, the numbers u give look like good EVs (presumably u got them from some ref that did such analysis, or a performance analyzer?) and they give the correct conclusion: the branch shouldn't be there.

I only glanced thru your algo (mem3), but here's what I noticed. There was a clever substitute for the initial pushes which saved - either a byte or a cycle, I forget which. The comparison tests were different; mine, I felt, were more intuitive. Other small details that amount either to nothing or not much; a couple slightly better techniques which I didn't bother to borrow (like the push substitute, next time I'll use them); a couple places I think my technique a little better. Half the algo I didn't read but presume there's nothing 2 important. But, I removed almost all alignments (from your mod'ed version of my algo); my tests showed they were basically insignificant (maybe a cycle or two) and I wanted to reduce the code size, enhancing cache fit probability by a half percent if that.

My point is this. The differences between the two algos is minimal (to my knowledge); but now you're showing that your branch elimination saves quite a few cycles, more than I had supposed. So why, then, is my algo - more accurately, my version of the same algo, since these diffs are virtually just a matter of style - why is it faster? (a small, but significant, amount?) Since u, apparently, have gone thru them with a fine tooth comb, presumably u must know. There's something about mem4, versus mem2, that's saving more than just a cycle or two - must be quite noticeable - so, what is it?

I'm not kidding about a couple days: happy to investigate this further with you when I have more time.
I am NaN ;)

nidud

#34
deleted

rrr314159

@nidud, Thanks for providing your test software, in future I'll use some of those (better) techniques.

But I think I've answered my own question, why my algo does a bit better when timed my way instead of yours. I'm using the standard approach: not loading .bin algos or FlushInstructionCache 'ing. (These are good things to do, thanks for the tip) But doing it the usual way, when the dest is aligned the branch (around the aligning instructions) is hit many thousands of times, and *always* taken. So, the processor predicts the branch perfectly. That jump costs almost nothing, couple cycles (not sure exactly). Whereas you're executing six instructions there (13.5 cycles). This matches my results: my algo's a bit ahead on ALIGNED data, but a bit behind on non-aligned, since not taking the jump costs a little. Conclusion:

- rrr's rule for not-very-careful timing test beds: Conditional branches (that depend in a simple way on the inputs) are almost free, so use as many as u want!

Reminds me of something hutch said, "I have always been an advocate of real time testing ...". In real life the environments will be very different, and variable, you don't know what you'll run into, so:

- rrr's timing statistical rule of thumb: ignore differences smaller than about 5%.

I think you should update your scoring routine to allow ties!

A more important thought re. branch prediction occurred to me. When a utility routine like momory-copy is called in real life, there's a decent chance the branch to it wasn't predicted right, so possibly the pipeline is going to be empty (other factors may cause the same thing). So, always put a (necessary) branch right at the beginning if possible, where a misprediction might be almost free.
Quote from: nidudHowever, most new machines (and almost all Intel CPU's) works best on aligned loops.

I'm sure you're right, but I saw approximately no timing difference with "align x". At least in one important case, it turned out the loop was (accidentally) already aligned. But even when not, no real difference: dunno why.

Quote from: nidudIf you think about it, the code above (and below) the loop label is peanuts compare to what happens in the main loop, so that's the main priority. The init and exit code will only hit you when multiple calls are made with small lengths.

Your first sentence is true when called with large blocks. But your second sentence explains why the code above (or below) the loop label *can be* important; your user may indeed be using it for lots of small lengths. That's why if one can save a cycle or two (for instance, by the placement of "toend" code) it should be saved in favor of small lengths vs. long lengths.

Finally, something you should know: if your routines are called to copy <=15 bytes, when PAGE_NOACCESS protected memory happens to be just beyond the source buffer, they'll blow up :P

c u later,
I am NaN ;)

hutch--

I have also learnt a few lessons on code alignment, regularly while bashing the guts out of an algo to get its pace up, I have tried aligning labels that get hit hard and seen the timing take a lot longer, much the same with procedure alignment, align at 4 bytes and it works OK, change it to 8 or 16 bytes and it often slows right down. Then its worth shifting the location of the algo around in the exe to make sure you are not deluding yourself and see what the results are.

I have tried spacing algos from each other with padding from under 1k to over 1 megabyte that was more or less useless so I usually settle of a delay between each algo of about 500 ms and set the priority class to high but not the one higher. As hardware gets smarter, timings wander even further.  :biggrin:

rrr314159

Quote from: hutchAs hardware gets smarter, timings wander even further.

Very true, meanwhile I'm getting slower ... but, I'm reminded of something hutch once said,

Quote from: hutchDon't give up on timing, its good for you and as an aside you tend to pop faster algos if you do it regularly. You get faster at it over time and with more practice you tune the timing to the task the algo will perform a lot better.

Also true! One good thing, it gives the lie to the C programmer's favorite saying, "the compiler does it better than you". Unless the compiler writers are down in the trenches timing away - as opposed to just believing Intel/AMD/Agner, as I think they do - they can't match a real MASMologist
I am NaN ;)

nidud

#38
deleted

nidud

#39
deleted


nidud

#41
deleted

Antariy

You may safely read/write buffers not in byte-by-byte fashion.

There is no difference in the "MEM" or "STR" functions - just imagine your "MEM" function received wrong bytecount to scan/copy/etc - and if the buffer has actually the size smaller than it was specified with bytecount param, and the buffer is lying near to the end of the memory page - the "MEM" code will crash as well as the "STR" code.

When you scan the memory you may do this safely with larger than byte-sized reads/writes, for this you just should take attention on the details how memory is allocated under x86 page addressed protected mode (Windows, Linux, Mac and any other protected mode operating system on x86).
The memory has the granularity of one page - 4096 bytes "at least", so you may always assume, that if you was given with some address inside memory page, there is full page in the memory available for read/write (depends on protection).
I.e., if you was given with an address 23987234h - just for an instance - so you may take the base address of the page, within which this address located - just clear 12 lowest bits of the address (AND EAX,-4096) - and you have the page address - for this example it will be 23987000h . So, now you know the start of the page, and you know that the size of page is 4096 bytes, so from that you may know how long is the buffer AT LEAST - 23987000h + 1000h = 23988000h. Within this range (from the given start of the buffer 23987234h to 23988000h) you may freely read the data in any sized fashion - just take attention that your reads do not read beyond that. I.e. you may freely start the single-reg XMM read (16 bytes) on 23987ff0h, but you obviously may not start the same read at the 23987ff1h - because final address of data grabbed 23987ff1h + 10h = 23988001h - so it is beyoud the page range and you may not be sure that the next page as reserved/commited in the memory.
Just take attention on at which adddresses you read/write, and, depending on the size of the data grabbing, you have all the info how to not read beyond the buffer to the inaccessible memory.
It is simpler to align the pointers to the power of two - so you may freely read and assume that the code will not crash - and it will not crash if you will check every read for match to the algos purpose.

As an example - just imagine that you use a byte-by-byte StrLen reading algo and SSE algo which uses single-reg XMM read. If you pass the properly formatted zero-terminated string to the algos - both algos will work correctly if they check the readed data to match the finish condition (the byte zero), before they read next bytes. If the both algos will take the wrong formatted strings without zero termanator until very end of the page - and no accessible area after that - the both algos will crash. No difference between algos.

What is wrong with that example of SSE code which does reading with two XMM regs: that code does TWO readings and ONE check after both readings. So, actually the code may read the proper string end (zero byte) with first read from both reads, but it will with no attention to that try to read the next 16 bytes - but the proper zero terminated string was already terminated in the previous 16 bytes (first XMM read) - so, being reading the PROPER zero terminated string at the PROPER possible location near to the end of the buffer with no accessible memory beyond that, this code will crash, that is why it is IMPROPER. We had a thread about StrLen XMM algo on the old forum, there were I, Jochen and Lingo as the main posters AFAIK, and Lingo provided the algo of such kind - which did read two XMM reads, then combined the mask of bytes result in one 32 bit GPR reg, and only after that did veryfied the result. Of course that code was buggy, and it is funny and shame (taking in account his megalomania) for Lingo he doesn't know such a simple and basic things like the buffers/memory layouts. The loud cries about "the algo is faster instead" is not excusable - the algo should be reliable first, and then it MAY be thought of making it faster. So Jochen decided to use one-reading XMM algo in his lib, and that was simplest and reliable solution. It is probably possible to do complex solution which will read more that one reg - but it will have to check every reg before it will have the right to read the next reg from the memory, so actually that complex code probably not only will not be faster than one-reg-reading code, it will be much slower. So Lingo's code was just a trash from the point of view of software production - it was just real fancy thing but not real working solution. Well, after so long time I described my point of view on the "Lingo's coding style" and the real "value" of his over-valuated "style", in proper English now :greensml:

Antariy

The Intel's "read ahead" is not actually a real read ahead - just show us full code and it is for sure that they do verify of the finish condition (zero byte) before they read next data which may be beyond page range. The full sense of read-ahead is only when the code does read only with not-dword-aligned strings, but taking in attention that the code might probably do the alignment (which is not obvious from the small piece of code provided) and/or that that every "advanced" compiler does string alignment on the dword boundary - this is not a dangeround "read-aheading" at all - it has nothing common with the algo which does two (or more) reads before it checks the condition of termination. If the algo with use of any technique is still in the range of taking notice to be in the page range, the algo will not crash. All other differences - are the details of implementation but not the idea itself.

nidud

#44
deleted