News:

Masm32 SDK description, downloads and other helpful links
Message to All Guests

Main Menu

Research on the algos design

Started by Antariy, March 24, 2015, 06:42:55 AM

Previous topic - Next topic

nidud

#30
deleted

Antariy

rrr,

Quote from: rrr314159 on March 22, 2015, 06:27:42 PM
Antariy,

Thanks very much for your comprehensive answer but I hate to make you type so much! No need to spell everything out in such detail. I see you have strong feelings about "protected programming" and Lingo so will leave those topics alone! As for "topic paster", it was just a joke, no worries.

Well, my feelings about those topics are just IMO. "Protected programming" is not a panacea, since the program after all runs on not perfect OS, which in its turn has its own bugs and problems. As for Lingo - I do not have any feelings about that person at all, well, that's what, after all, he deserved in my opinion: the "name" not of a "great coder" but an immature and uncultured person with exaggerated selfopinion.
As for "topic paster", well, just cleared that out so you didn't think that that was written not as sligthing word.


Quote from: rrr314159 on March 22, 2015, 06:27:42 PM
The important thing is the code. I understand your reasoning (of course) but it offends my aesthetic sensibility to see all those bytes wasted! So, considering your suggestions, I made some modifications, and tested the shorter algo against yours as well as I could. Bottom line, it seems the shorter algo is, for the most part, faster.

Well, those bytes on some machines may be "wasted" on some may be "useful" - that has the only difference where it will run. This is something similar to the other code which was actively discussed here - the hex2dword algo. There my initial code was sloppy, but it implemented an idea of unified processing of the hex string - with no difference to the case of the letters - using one path without checking and correction of the letters to be some "hardcoded" case (upper or lower). It worked well, but with statical tests (just like we used with the StrLen here) it was slow and I was drawn into discussion which took some days and many tries to make the algo work "faster" on modern CPUs with the "statical" test. That was hard thing to do because all my "optimizations" I was done - didn't make the algo any faster on my own CPU, so I was optimizing it just "by touch" and analisis. Well, after all the algo became "the fastest" (it was one cycle faster than the previous "winner" on Core2+ CPUs :greensml: And it was ~2 times faster on my CPU - but it was also faster than the previous "winner" before optimisations, so that was done just for "sake of the challenge" ::)) algo on every (!) CPU it was tested there - but that cost much of time and tries, with, actually, no reason, because the dumb unrolled code will always be "faster". Well, I was testing that here for probably first time from registration on the forum, and after finding that the unrolled algos will for sure be faster even in real world conditions, I chose to use it instead of short looped algo.
But, for what I telling this you? Because the situation is very similar now. And because that my algo which was the "winner" between looped algos in that hex2dword challenge, also did use a lot of those "wasted bytes" to do its task, and it did that. The algo was the fastest only when used in a "bloated" manner, when you make the instructions smaller and instead of pre-checking the starting condition before loop, just jump into the loop, the algo was not "winner" - it was slower by couple of cycles but that was important there because all the discussion there was in the "who are the fastest?" manner. So the code was bigger than it might to be for, probably, 30% there, just because of the "padding long instructions".
So, if I want to take the most from some piece of code, especially if that's the short-cuts like code entry and some prechecks before going into the inner loop, I do not hesistate to use longer instructions because this may save some cycles when the code does a short-cut and the timings are small in any case with any other algo, so those saved cycles maybe useful to "show" the better timings ::) Well, yes, as you may notice, the timings become a an "idol" in these tests. People look at them, respect them, compare and align theyselves code with that, so, aligning to the code timings is, probably, not the best thing because it, after all, leads to the patterned code design. The interesting point is: my hex2dword looped code was not pattern-like type in its idea, and, though initially the idea was implemented "slowly", after the "contest" it was the fastest - though it was still the same code with the same idea, and, more over, it works on any 32 bit CPU starting from i386, at the same time the previous "winner" was only P6+ compatible code (due to CMOV). The better idea, which avoids the conditional processing (jumps or movements) at all, just won the patterned idea at speed and the compatibility with the CPUs it may run on. But how much time it took to crazly-optimize the code, make it more long, bloated (still, we talk about less than 80 bytes), to win the "contest"? It's useless thing to do in an "everyday manner" - just if you feel that mood then you may want to do that. That's why I just used the longer instructions to align the loop with not too much other tests with other instructions lengths - because on every CPU it will run differently, a bit faster on one and a bit slower on the other, and after all it will run on the real world conditions where all algos will go "head to head" even being crazly complex and sophisticated. I just wrote the code with good idea in it - you may notice than my 16 bytes align on the code entry is even smaller and simpler than the Intel's one in the Silvermont code. That was the first step in the algo, and it was simple and elegant, the next step was then simple processing in the aligned inner loop. With the aligned data. Though the loop itself may be for some opinions aligned not very elegant with all those bytes, but the data alignment is elegant, so the code is elegant itself. I agree that it may look not "aestetically", but look at it as on a simple code which does its job well. Yes, it may be modified and tweaked in a bunch of a ways to run better on some specific CPU, but, as I said, it was written to perform "well" on every CPU - not to perform "the best" on some family or even model of CPU. And it's simple, yes, maybe a bit "sloppy", yes, it may be optimized to be smaller/faster, but that was not the aim since the reasons written in the "Research" thread.


Quote from: rrr314159 on March 22, 2015, 06:27:42 PM
First, to avoid register stall with these instructions: mov edx, [esp+4] / mov eax, edx: I just moved the mov down 2 instructions.

I didn't know xorps could save a byte over pxor, thank you. I used those 2 bytes to put the jump back in. It was necessary for the AMD which is horribly slow on bsf. I jump into the loop, skipping the "add edx, 16", so the logic remains valid.

Still preserving XMM and ECX.

Yes I know the purpose of the db instructions is to pad 9 extra bytes to align the beginning of the loop. I know that's better than nop's or lea eax, [eax] which waste time as well as space. But surely it's even better to not waste the bytes, as long as routine is still as fast or faster.

CPU branch prediction - u know, behavior seems to change with every new release from Intel or AMD. Often, we optimize a routine on our own machines, but on newer/older machines it may behave different. I routinely optimize on my Intel I5, then try it on AMD A6 and Pentium 4; often fastest on one machine may be slowest on another. So I'm leery of artificial coding techniques for speed.

Now, I had thought you were right: pointer advance should go AFTER data fetching. However on the Intel my loop was faster. On AMD, a little slower. Apparently the order of the two instructions makes little difference. Anyway, there's very simple correction available, if needed. Just "pcmpeqb xmm7, [edx+10h]" first, then "add edx, 16" - uses one more byte.

By far, the hardest part of the whole exercise is not writing the routine, but getting semi-reliable timing! First, I used your suggestion and put all algos in separate segments. Then, I doubled both of them; put yours first and last, mine 2nd and 3rd. It appears the "last" position is slightly favorable. Then, in desperation, I copied/pasted the timing runs a number of times, using just the final numbers.

Well, that probably will avoid the stall, yes, depending on how many execution units are in the CPU core and how deep is the further independentness of those instructions. I.e. - moving down the mov (the pun), you then moved it closer to the next and of that reg which you moved into, so, removing the stall above you then provoking it below - since the entry code is very compact and just has no additional space to spread instructions over.

Also you may use movaps/movups instead of movdqa/movdqu - it's the same 1 byte shorter. Just a note, because in the code it's already used movups.

Yes, I agree that if the routine is smaller and stable faster than the previous version on every CPU, then it's better to use the shortest version. As I said above, I was just not playing with that much.

CPU brances were always "predicted" as "not taken" for the jump further and as "taken" in the jump back, and this logic will probably not change in any CPU because this is simple logic of the loops - you do the work, then you check the condition, then you jump back to the start of the loop - this is more efficient in the speed meaning, yes, it's inefficient in the code size meaning, you need first do the check for the condition before you enter into the loop, but that was probably considered as a deal to the more speedy exit from the loops and more speed work of the loops - the CPU "sees" than the jump (especially conditional) is taken to the "back", so it "knows" that this is the loop and that it is almost always more probably that the jump will be taken than that it will not be taken - because loops are always executed more than once, they may execute thousands or millions of times, so, if the CPU will each iteration "mispredict" the loop jump, it will cost very much cycles. So, the CPU just treats the conditional jump back as "almost unconditional jump" - and only one misprediction may occur while the loop ends - but that's one small stall after thousands of properly jumped jumps (the pun again). So, the conditional jump back, when it's taken by the logic, costs almost nothing for the timings. But if the jump is untaked (final of the iterations) - it will have more clocks since it's "mispredicted". The same with the jumps further - the CPU "thinks" than it will not be taken, and if the jump further is taken actually - this costs some clocks because it's "misprediction". If the jump further is not taken in the code run - then it costs almost nothing to execute than Jcc instruction. And you may notice this technique is widely used, for an instance in the highly unrolled Intel's Silvermont algo, where the code has much conditional jumps further - those jumps cost almost nothing, and one of them, being jumped after all, will cost some cycles, but after than jump the code will finish its work, so that's not the matter for the deal. The code saves those cycles using the further jumps test, after execution of some those almost nothing cost jumps the code already "prepaid" the stall of the taken jump further, and taking in account it will finish/go closer to the finish after than this "misprediction" of jump further has no any to care about.

I think that pointer advancement before or after a fetch is a very disputable stuff due to every particular CPU family and even model's implementation - on some there's no slowdown (at least noticeable) when you put the advance before the fetch, on some other there is very noticeable slowdown, so, the matter of choise is that, probably.

Yes, it's hard to get reliable timings, that's why we use the testbed on the lab with million of the repeats to get more or less stable timings.


Quote from: rrr314159 on March 22, 2015, 06:27:42 PM
Your routine was a bit better on Intel ALIGNED 271; also slightly better on the longer strings on AMD. Everywhere else, the shorter routine is better. It's dramatically better on AMD short strings, who knows why; and better on Intel long strings. BTW all these numbers came out pretty much like this on multiple tests; I'm only showing one typical run from each machine.

Here is my modified routine:

; «««««««««««««««««««««««««««««««««««««««««««««««««««
algo2 proc SrcStr:PTR BYTE
; «««««««««««««««««««««««««««««««««««««««««««««««««««
; rrr modified version of Antariy algo - number 2
    mov eax,[esp+4]
    add esp,-14h
    movups [esp],xmm7
    mov edx, eax
    mov [esp+16],ecx
    and edx,-10h
    xorps xmm7,xmm7
    mov ecx,eax
    and ecx,0fh
    jz intoloop
    pcmpeqb xmm7,[edx]
    pmovmskb eax,xmm7
    shr eax,cl
    bsf eax,eax
    jnz @ret
    xorps xmm7,xmm7
    @@:                       ; naturally aligned to 16
        add edx,16
    intoloop:
        pcmpeqb xmm7,[edx]
        pmovmskb eax,xmm7
        test eax,eax
        jz @B
    bsf eax,eax
    sub edx,[esp+4+16+4]
    add eax, edx
   
    @ret:
    movups xmm7,[esp]
    mov ecx,[esp+16]
    add esp,14h
    ret 4
algo2 endp
end_algo2:


Bottom line, I can't believe it's right to waste those 18 bytes.

Well, as I said, if you can make it smaller and you see that it's reliably faster, then that thing is proper thing to do, I did not argue with that and do not try to prove that all those bytes MUST be there :biggrin: Also you may avoid saving of xmm and ecx, this will save you 3-5 cycles too, which will be especially important with short strings - the case when the speed of algos is a matter of the speed of its entry and exit and not of inner loop speed. Here is another tweak of the algo, did not test it though. But in real world it will perform pretty well. Some rearangements in the data fetch and the entry code - since the XMM and ECX saving is avoided so there is no need to use the same reg XMM.


algo2 proc SrcStr:PTR BYTE
; «««««««««««««««««««««««««««««««««««««««««««««««««««
; rrr modified version of Antariy algo and then again modified by Antariy :) The chain of modifications :)
    mov edx,[esp+4]
    mov ecx,[esp+4]
    and edx,-10h
    xorps xmm7,xmm7
    sub ecx,edx
    ;movaps xmm6,[edx]
    db 0Fh,28h,72h,0
    pcmpeqb xmm6,xmm7
    pmovmskb eax,xmm6
    shr eax,cl
    jnz @ret
    @@:
        movaps xmm6,[edx]
        pcmpeqb xmm6,xmm7
        add edx,16
        pmovmskb eax,xmm6
        test eax,eax
        jz @B

    sub edx,[esp+4]
    bsf ax,ax
    add eax,edx
    ret 4
   
    @ret:
    bsf ax,ax
    ret 4
algo2 endp



It will be probably 68 bytes and the loop should be still aligned at 16 with only one byte of padding in longer "than required" instruction :biggrin:

Quote from: rrr314159 on March 22, 2015, 06:27:42 PM
Finally, of course I can write Russian! I'll do it again: "Russian". - very easy. OTOH I can't write in Cyrillic to save my life :)

:biggrin:

Antariy

Quote from: rrr314159 on March 24, 2015, 12:01:19 PM
There's a problem with this whole exercise: same code runs very different on different processors.

Quote from: nidudbut this seems to faster for some reason

I do the same thing: try different combinations and select the fastest on my machine, often not knowing why it's faster. Problem is, what happens on another machine?

Well, it runs different but not totally different, and, being talking about real world conditions you may be sure that the code which gets close to the bottleneck - the memory bus bandwidth in this case - will run well on every machine.

What points do we have? We have the algo, which does the continuous access to every bytes of the string, and does this just once per a call. So, as was written in the "Research" thread, you may assume that the algo will work with the maximal speed which the memory bus is providing. With no difference how complicated code is - if it has hit the bottleneck - it will sit there in real world usage. So, we have the poins: just as an example, the DDR1 memory has "wire width" 64 bits, 8 byts. With the effective datarate 400 MHz this will be 400 000 000 * 8 = 3 200 000 000 bytes per second (That's why the DDR400 standard is reffered as PC3200, too), so a bit more than 3 GB per second. At the same time the CPU has access to its internal memory - the cache - with the speed more over than 20 GB per second. Yes, even with old CPUs like PIV are, we have seen the ~30GBps data rate for algos tested on the lab (that was, for an instance, the lines count algo, tested with the dataset fitting into the cache). 30 GBps - unbelievable, yes? Compare it with 3 GBps. So, actually, you may think that 9/10 of the algo work will be in waiting the memory bus to deliver data to the ALU.
Another point: DDR "wire" width is 8 bytes. But the XMM move takes 16 bytes. So, the one (!) xmm move takes at least two transactions to the memory bus. And it's even more complicated if you will take notice than CPU doesn't fill the cache with those "just asked" 16 bytes, it will fill entire 32/64 bytes cache line. That's why the code which accesses the sequental data is more performant, than the code which accesses the data in the random fashion - because being accessed to the memory location with the address which has zero lowest 5/6 bits, the code waits not only for the data required, but for entire cache line fill from the memory, i.e. the 32/64 bytes are transferred (at least) from the RAM into the cache line. So, the code which accesses in the manner "the byte here, and the byte here" - not consequently - will cause with each access the tranfer of entire cache line, even if the code was accessing only one byte.
So, from this point of view we see that the only one "the most effective" access to the memory is with 32/64 bytes (this is not constant, this is depended on the CPU generation and may differ) chunk sequental access. All the other accesses are "less effective", but just in theory, because after all the code sticks with 16-bytes wide XMM regs and that's the maximum which the 32 bit CPU ALU can handle at a time. If we had 64 bytes regs - the code using them will have the most efficient corellation between 64 bit aligned memory transfers and ALU operations.

The other point which you maybe saw and wondered: why it maybe so that pcmpeqb xmm0,[ebx] is slower (!) than movaps xmm0,[ebx]; pcmpeqb xmm0,xmm7 (being said xmm7 is zero)? That is because of uneffective transactions between 16 bytes regs and 8 bytes data bus, for some reason the direct comparsion does this transaction not very well, but the movement operation is better optimized in hardware. It's the thing like that: mov eax,[somevalue]; push eax is faster than just push [somevalue]. That may look strange, yes, but that's hardware implementation specifics.


We do the optimisations, but after all we are behind the hardware implementations, and that's the only point why the algos perform so differently. But, then, you'll think that probably it's better to write the code, which will perform equally well on every machine, rather than it will be optimized insanely for the single type machine. We are talking not about industrial programming where you fit your code under the hardware, but about code which should work in a "common machine".
So, thinking so (pun), you will write the "code which is enough to do its task" and which will be sophisticated enough just to hit the hardware limitations. In this extent ANY more or less well written SSE code will run much more faster with the data sets which are fit into the cache. So, now you have all the info to design your code, depending on that it will do, which amount of data - fitting into the cache or not -  it will process and in which manner it will do so - repeat the "number crunching" loops over entire data again and again before it will produce result, or it will loop throug entire data just once (like StrLen does) and, actually, doesn't do any huge numerical processing things.
So now you are sure that GPR's written strlen code is not suitable for you, because it's too ineffective in the regard of memory throughput as well as ALU operations on the smaller data size. So, you know you need the code which uses more advanced size of data processing and more effective interaction with memory, and you need the code to be able to run on wast majority of the machines all over the world. So you choose SSE2 code. Then you think on how the code will process the data - will it repeat much inner loops over the data to provide the return result? If not - then you may forget about thing "cache" with all its help - there's no effective interaction with memory, there is no cache-like speeds, there is just the maximum which the system can provide - the RAM speed. And the SSE code will be faster than the RAM speed, so even sloppy SSE code will beat (if it's not toooooo sloppy) the best possible GPR code. So there is no need to write the algo mile long and bloated - it anyway will not do any much other than impressing the public with its own size.



Quote from: rrr314159 on March 24, 2015, 12:01:19 PM
I put together a new "AxStrLenSSE" which beats the Silvermont on my i5. It was easy. First, notice that Antariy's original algo saves XMM and ECX, because of environment it was designed for; but Silvermont doesn't. So I took that out. Then, it was still slower on the long strings. So, I unrolled the loop 9 times, and it easily "won" on nidud's test bed. But then I tried it on my AMD, and it came in 3rd (SSE 32 also ahead)! (AMD is very bad for bsf, pcmeqpb and pmovmskb.)

...

Anyway, here's this new version of Antariy's algo, which is better on i5's than Silvermont, but worse on AMD ... I know there are various ways to improve it, but I think I'll do something else instead!

Yes, for sure you will optimize it for you machine the best possible.

As for an AMD - try to use as much as possible differen XMM regs and do pmovmksb into as much as possible GPR regs. AMD tends to have better performance with the code which is "truly independent", probably its "register renaming" mechanism is weaker than Intel's one is, though, when you write unrolled and really independent (based on the names of registers) code, it will perform probably even faster than Intel's one. But this is just a note, I do not claim that this will work for every existing AMD CPU.

Quote from: rrr314159 on March 24, 2015, 12:01:19 PM
Working at Intel, Torbjorn Granlund has access to hundreds of machines (probably) and can try his Silvermont algo on all of them. If you don't have similar facilities there's no point in trying to keep up with him. I also think it's ridiculous to use 700 bytes for a strLen algo. This is one reason Microsoft and other modern code is so bloated.

Just the same, Torbjorn might have some good ideas. There is only one thing he does I've never seen before, uses pminub to take care of the situation you (Antariy and nidud) are discussing right now. That is, integrating the results from 2 or more XMM's reading the string. Perhaps you should take a look at his technique.


Well, the place where he works doesn't mean he produces the best possible code. Again, those tests which prove it as the fastest possible are exceptionally syntetical. In real world there is no need on almost kilobyte code which will "do well" only if the string it was accessed to WAS ALREADY in the cache totally at the time you called the code. Being part of the application which works under multitasking OS, this code may forget the word "cache" and it will, for sure, on real machine in real app execute not faster than other SSE strlen which are even less than 100 bytes long. More over, not only the data cache gets overflooded with the other data, but the code cache, which is smaller, gets overflooded many thousands of times per second, so being executed not in exclusive environment where the only one thread of one app owns all the CPU core time, the longer code will also stumble over a decoding stalls each time it gets called. The marketing stuff like that one is nothing more than marketing stuff, just syntetical test like we are using with millions of calls to the same code with the same short data on the lab.
In any point as you will see it - you will find that the code like that is nothing really useful in real world app. Probably that's something relating to the Freud :greensml: Big guns, big cars, big ships, big codes :greensml: People like that - the bigger the better they think - just psychology of mankind.

Here we have the wide set of the machines - from old ones to the newest ones, here I agree with nidud, at the forum we have pretty wide set of machines to test on and to collect the info about optimisation.

The Silvermont code is pretty logical and simple - just heavy unroll checking till the buffer gets aligned to 64, then going into 64 byte aligned loop. And a bunch of possible return conditions are processed with different return places. There is nothing interesting or new. But pminub as a method of collection of lowest bytes, searching for zero, is interesting, yes, but I'm not sure it is fast on every machine, after all comparsion with zero is simple operation which may have shortcuts in the circuit, but just comparsion with actuall processing of the value of the data, and then rearrange of the bytes in the regs - may be not so fast. So I'm not sure if that even with strings located in the cache 64-byte fetching loop will perform (much) faster then the loop I used in the big algo used in the "Research". The idea is interesting and needs to be tested, but then there is question about than the big algo in the thread is just a prototype and is not actually finished - I was lazy to do that and just had to rewrite other algo which actually has very different design. Being paid for that, people are tend to write miles of code ("the bigger, the better") but here we write just for fun, so I just dirty hacked the algo into testbed - it works properly there, though its alignment fixing has some things to be rewritten completely, and it shows the advantages of 64-bytes fetching inner loop with the more or less long strings (but still fitting into the cache), and it shows the disadvantage of such a big and complex algo when the run go in the real world conditions - the code is huge, the time to develop such code is high (maybe this is the reason? the bigger - the better, yes, the longer the code - the longer you design and develop it - the time is going and the saliry is paid), but the real gain is nothing more than hardware can actually provide for even much more simpler algo.

Antariy

nidud,

Quote from: nidud on March 25, 2015, 11:24:30 AM
Quote from: Antariy on March 25, 2015, 08:24:01 AM
You are reffering to an experience - but your experience being even unlimited will not disprove some simple, basical things, like the memory bus speed is, like the caching subsystem is..
Quote
What idea does this research prove.

We all know and like to write the algos and to test them on the lab. The problem is in the method which is used in that. Currently we "test" the algos against short (relatively to the 2nd level data cache, read below for explanation) pieces of data, and the test runs many times on the same short data. Being executed in these conditions, some types of algos do not show the real world performance of their.

After first (at least after couple more) interations of the test - call to the code tested and return from it - the short data is already placed in the cache, so the next iterations of the test are going in an unfair (read below why) conditions for the real world performance of the code. So, just first iteration runs with the real world performance, the next, many thousands of iterations run in not real world but "static" conditions, so in final average results the timing of the first run has no influence to show the real performance.

In the testbed the functions are read into a virtual protected area of the code segment, and they are all called in a validation test before the timing begins. This means they all start with all the data already in the cache.

Again, you missed the point, or, just entangled what was said there and what do you get. Maybe that's a language barriers (especially we both not-native English-speaking I guess).

There is not said that some of the code has advantage over other code, not said that some code runs with the data already in the cache. You probably think that there I was trying to disprove that you 32-bytes fetching algo is faster than 16 bytes fetching? But that was not the point, I don't care if some one writes code faster, nor even care about Intel's strlen, because I just know that in real world condition that is just an exaggerated (with no reason) bunch of bytes the value of whose is exaggerated. The "brand name" "Intel" before the algo doesn't mean I should respect the algo just because it has that brandname on it. After all Intel is known in before in plaing unfair games with AMD with their "optimising" (cuasi-optimising, which optimise only for AMD - refer to Agner Fog's research on that) compiler runtime. So the software they produce doesn't mean the same "mark of the quality" as their hardware is (someone, though, might discuss that as well, there are haters of AMD as well as Intel, but that's out of the point here). They after all produce hardware in first place, and then and only they must force users to think that the hardware is great. So how will they do that? Just write the "impressive looking" code which takes the best from the latest hardware development, but that hides behind the scene much of details. On any place you will heard of that Intel's Silvermont StrLen is the fastest possible, just cries of the fans, but no one of them do not say that that code is just bloated piece of software which in real world will execute the same (if not slower) than much more shorter and simpler algo (your, for an instance). This is "popular culture" inside IT, like it is in any other field.
So there is no any point in that that the Research tries to disprove "speed" (this therm is relative) of some algos or any algos at all. You may be sure that I see that your algo is "faster" (it is faster within the "statical world", the same is Silvermont is faster than your algo in the same "statical world", but in real world you algo performs the same, as Silvermont, and my simple algo performs the same as Silvermont, too) and agree that it's faster with some ideal conditions. I just proved with the research that those ideal conditions will not happen in the real world. Because there are so much discussions about those algos on the forum, that it's just trying to make notes on that, with no try to offend anyone in their will to optimize code.

In the quote above you agree with that "all the already in the cache"? These are you own words, so you agree with them? But that is what this research does prove, and what is the problem - being in the cache the data actually fools the timings, the CPU plays unfair games with the tester of the code. Because CPU core and CPU cache interconnection is not the same as the cache and RAM interconnection. You see the algo performing in ideal conditions which will never happen to it (being said that's the algo which falls under point 3) from the second main post of the thread - and that was underlined many times) when it's executed in the real world machine under multitasking OS and even more over under user mode. Only if your code exclusively own the CPU core with no any other code before and behind it, the cache contains the data which you used last, with the maximal size equal to the size of the cache. But the cache buffer is "circular" - it's limited and new data access will flush the "older" data from it, and remembering that the cache has "linear" (some other kind of "page") architecture, access to the even only one byte in the memory which is not cached in the cache currently, will lead to refill the cache with 32/64 bytes of data from that memory location, starting with the address equal to address_of_the_byte_accesses AND 0FFFFFFC0h. So, under multitasking OS, which simultaneously runs tens of threads, and any of those also acccess the memory which is FOR SURE is not from your process memory location in the RAM - i.e. every process has its own memory mapping into the RAM, so, effectively, the cache size which is in use for your process, roughly saying, equal to the division of the cache size to the number of the processes. Because every process has its own, not overlapping, RAW address of the data in the RAM, so the cache effectively splits its volume, trying to cache all those different addresses which are accesses by a bunch of threads of bunch of processes. So, in the real world the data is actully flushed from the cache very fastly after it was accessed, because there are no reasons to keep it there - no one accesses to the same string constantly and continuously all the time, to keep it in cache for long. But our testbeds do so. So they absolutely different to that conditions in which and how the code will be runned in the real world machine, which, as already been explained, flushes the data from the cache very fast after it was accesses. So, if in your application you do not provide the separate thread which only task is to constantly access the data bytes of the string to keep it in the cache, you may be sure that the string you get in your strlen code is not in the cache in almost all cases being runned under the real world computer. "Almost all" - practically all, the only exception is if you for an instance computed string length and then just after that, with no big delay, do some other things with a string, for an example scan it. But, then, if you scan string, why did you call the strlen? You may do the strlen/zero terminator checking when scanning the string itself, like strchr function does. There are actually no any reasons you may imagine which may be a warranty that your string is in cache and continues to be there all the time - possibility of that is astronomicall small.

So, in the real world the algo (every algo) gets the str pointer to the string which is not in the cache. So, when scanning it the code will have to deal with memory bus speed. But more over the string data may be not in the RAM, it may be in the paging file, and that will slow things down in an incredible extent.

That's the one of the most important reasons why when you run the test only once over the tested piece, the timings are crazly high. And that's why we run the test millions of the times - to avoid the influence of paging and memory bus speed. But those conditions are ideal so the code is tested in syntetical conditions and "on a papeer" being even faster in 10 times than the other algo, in real world you will stumble over the thing that the code runs at the same speed as the more modest and short code. At the point where the every code hits the real world limitations, there is no reason to optimize/unroll further. And being real world code, the code will stumble over the problem of not-fast-as-the-light decoding of the code into microops. So the longer code will have longer stalls even before it will execute. And remember that the code trace cache is even much more smaller than 2nd level data cache, so the code microops are flushed from it practically in one moment after the code leaved that part. The multitasking makes things even harder - too much threads and your code has only small code cache within the size of the time frame of one OS quantum it gives to the thread to run. If some large code crossed the timeframe, then with next turn of the thread's work it will be reloaded and decoded again, so there are possible situations where very advaced code might return results slower than simple and slow GPR code :lol: That's very rate situations, and mentioned just for fun. The code caching problems is not that (almost not at all) important for the tests, as the data caching issues and the syntetical test results.

Antariy



Quote from: nidud on March 25, 2015, 11:24:30 AM
Quote
OK, well, being reffering to your experience just couple of weeks ago you was fully sure that there is no possibility to write a safe algo and that the "deal" of a speed versus robust is acceptable.

Don't remember saying it was impossible or acceptable, but I assumed there was no danger in the environment where the functions was used.

You said that the only safe algo will be byte-by-byte - how else can you treat those words? The same may be spelled like "all not byte-by-byte algos are unsage".

You "said" about acceptability of that in the other manner - you provided the timings where showed that the unsafe version of 32 byte fetching StrLen produces faster results, much faster. And you noted that even algo is unsafe, the speed matters, so that way "deal" in other words, you accepted its unsafety in deal to its speed. Yes, maybe you did not go to use it in the safe environment software, that was not the point, the point was that your experience changed just some days ago - experience and opinion about possibility to implement safe and simultaneously fast algos. So, the reffering to the experience exclusively is not an argument - that's additional point, but not the argument.

Quote from: nidud on March 25, 2015, 11:24:30 AM
Quote
"Experience" is just a word, and it will not disprove basical things like caching/memory bus speed limit/OS paging. We all here may refer to our exclusive and limited (every people has limited experience) experience, but that isn't an argument in a disput, that is just a "side note" like "from your experience something was so".

In this case the "experience" is based on testing with some good results.

For sure the results will be good, where did you see that I said that your safe 32 bytes fetching algo is not good?

The research say only about full nonsense in developing toooooo entangled and toooo bloated/sophisticated algos because all of them will finally hit to the same speed as the other ones. The 68 bytes version above of the single XMM reg using StrLen will perform in the real world the same as over 800 bytes Silvermont - that's the point. Which code is better is just a taste and each own's opinion. The same your code is not slower in real world than Silvermont. But not faster than single reg using - this is astonishing but is true, and was proved by the research. The differences are just some parts of a percents for each code and are actually showing not how the good in thems of an ALU algos perform, but how good the memory bus subsystem performs in relation to the cache, with that method the code uses.

Quote from: nidud on March 25, 2015, 11:24:30 AM
Quote
Quote from: nidud on March 25, 2015, 07:23:53 AM
Quote from: Antariy on March 25, 2015, 06:13:10 AM
You missed the main point. You talk about "million times called" functions, but where did you see the application which does this with the same string consequently?

All the time where speed matters  :P

The sarcasm and the "tongue" smiley is not the argument. You are unable to disprove what was said here and you will not ever be able to disprove that - take that or not, every people has its own right to be a fanatic.

:biggrin:

Sorry, didn't see the tongue there, but one tend to use the same memory in these loops.

You mean the same cache? Yes, all the code on the system used the same cache for the code/data, and that's the point. Your code should always assume that the data is not in the cache when it accesses it first time and not longer than some microseconds (to say it clear - just for one walk through data bytes). But the cache size is limited so it's something like a circular buffer - the oldest data gets rewritten with newest data - so if the code doesn't keep access to the same data all the time it will fast be flushed from the cache. So, the data is "always" in the memory - until the code will access it and walk through it, after that the data is in the cache - if it fits into it. If it doesn't fit there then the only some piece of the data is in the cache, but being executed under multitasking OS even that small piece of the data which contained in the cache will be flushed from there fastly - the cache is very volatile memory storage, if it will keep some data with no reason, this will crazly slow system down. So there is the most frequently accessed data and/or latest accessed data. And that was said: it's unbelievable that there is an app which does strlen for the same string all the time to keep it in the cache. So, strlen algos may freely assume the string passed to them is not in the cache, so it's in the RAM (or even in the paging file), what is the limiter of the algo's speed. In real world probably even single reg MMX strlen will perform the same as Silvermont strlen. The SSE and MMX at all is designed to do some more or less comlex arithmetical actions on the data - the speed up is there, not in actual "wider" size of data access, because in just sequentally accessing algos (with not much of arithmetic in them) the memory becomes the limiter. But in nunber crunching algos SSE/MMX are the solution to speed things up in much more greater factor than the GPR code may do, because not the memory is the limiter here but the lesser data which GPR ALU can process simultaneously. Once the data in the regs/cache, the SSE code may do many complex computations on the packed data and it will do that much faster than GPR code. At the same time the crazy unrolled SSE code like Silvermont one is, will probably perform not more than 2-2,5 times faster than the regular GPR szLen from the MASM32 lib. Even though Silvermont "does processing" in 64 byte chunks and 16 byte regs - szLen does that byte-by-byte, Silvermont does that 16 bytes per operation, but it will be faster not in 16 times as it should look, but only in 2 times faster in the real world. And maybe even slower sometimes because it's much more big than simple GPR szLen.


Quote from: nidud on March 25, 2015, 11:24:30 AM
Quote
Try to find the example of algo which will call the strlen millions of times for the same string - consecuenly. Why the word is underlined? Because of you call it from time to time - it will be just in some milliseconds from the previous call flushed from the cache, so, the next call to the code will again go the slow - memory bus speed - way.

And there I disagree. When you tokenize a text file you tend to reuse the same memory. The same with reading files where you use the same file block which will end up in the cache before you call strlen to get the length of the file. There could be millions of files down the subdirectories, and you may open all of them and do a byte scan and so on.

When you reading the file, it ends up in the RAM (at best, or in the paging file), not in the cache - DMA is thing to free CPU to spend it's resources on some perephiral actions.

After you have read the file, you then pass it to some function which processes the data. But being in the virtual memory of the process, the data maybe not even in the RAM, but in the paging file so it will be very slow to access than data first time after that it was flushed from the RAM to the paging file. OK, well, after some thousands of cycles, the page (only one page, not even entire file's data) with data is in the RAM. Now CPU has access to it - with the speed that RAM bus provides. The code computed string length - the string in the cache after that. But what will you do with that string next? Remember, you have only some microseconds to continue work with string to keep it in the cache - if not then it will be flushed from cache memory. So you have the string length, and for some short time you have the sting itself in the cache - but that is not important for you and more over that has no any sense, because you will next open other file, process it and so on, and every time you'll have to deal with the RAM speed (at best), not internal CPU speed. That string which you was accessed previously will be already long time flushed from the cache when you will read the next string from the file, so the next time you will access than string somewhere in the program, you will again deal with the RAM speed. So, the cache has no any sense in that, so, to do the tests which are close to real world, you should be able to "turn cache off", and the simplest way you may do this - just process the data which size is much more bigger than the cache size, so then even with multiple loops over the data you'll have more or less close to the real timings. "More or less" because even there you will get better results than they are in the real world. Because after first runs of the code over data, the data will be commited by system into RAM (i.e. it will for sure be not in the paging file), so finally you will have the results which are more close to the RAM speed limit than to the real world values (which will be even slower).

An advancement on the testing suggestion which was described in the main post of the thread.
We can lock the data in the memory, so the code which will access it wil deal only with RAM, not paging file, speed. But there are the limits on that how much can you lock in the memory. And before lock one should call GetProcessWorkingSetSize, then add to the results of it the size of the buffer you want to lock in the memory, then call SetProcessWorkingSetSize with those computed value. And only after than one may try to call VirtualLock to lock the buffer/data in the memory. If succeded, then the data/buffer is locked in the RAM - and this part of RAM isn't usable by other processes, that's why this feature is limited for security reasons (if some processes will lock too much memory, then system might just go on low memory and crash or become unusable just like if it had to run on very small amount of RAM). So, practically, we probably cannot assume that we will be able to lock 300...500 MBs of data in the RAM, so, we will have to deal with paging slowdowns in our tests, too, so being tested the, let's say 300 MB of data on the system which may provide this amount of memory (i.e. the 256 and even 512 MB systems are not suitable to provide reliable timings), and 10 times of loop over the code, we will have more or less close to the RAM speed results (as you remember, in the test in research there was 67% of the RAM speed achieved). And this will be more or less close to the real world testing for the algos which fall under the point 3) of the research conclusion.

Quote from: nidud on March 25, 2015, 11:24:30 AM
QuoteYou may take that as a rants,

Maybe a bad choice of a word, but there was a lot of text there  :biggrin:

Descriptive explanation is not the same as "rant". I was not insisting to anyone read it - this was written for usefulness to people, and the prople may read or may not read that - that is their right, but being not read it, it is a bit strange to argue with it and try to accuse the writer in rants.

Quote from: nidud on March 25, 2015, 11:24:30 AM
Quote
Well, but this is probably rant, too, so if anyone has right to stick with its opinion - the disput is useless.

No, that was interesting, and I disagree, dispute is not useless.

Is not useless but not very simple especially for non-native English writer.


Antariy

Quote from: rrr314159 on March 25, 2015, 11:50:35 AM
AMD, 2 mb L2 cache

Intel, 1 mb L2, 6 mb L3

(I think)

This testing is a PITA I expect at least a 1500-word post from you after all this trouble :P

(From other thread, Antariy advice:) On AMD use ax with bsf :redface: I should have thought of that

No, the testing is good. Where do you see the PITA? And who is he? :biggrin: BTW what this word means?

From your results there's obvious changement in the timings on 1 to 2 MB cross on the AMD, and the same on between 100 KB and 1 MB on Intel, so the test works well.

Antariy

And now all of you are just obliged to read this all :greensml:
Well, this is good training in english writing, but I see that the more tired you get the worse your english is ::)

dedndave


Antariy

Quote from: dedndave on March 26, 2015, 05:48:46 AM
PITA = pain in the ass   :P

Thank you for explanation, Dave :P :t

nidud

#39
deleted

hutch--

Alex,

One comment on your English practice, use more paragraphs as it breaks up big blocks of text and makes it a lot easier to read.

RE: hardware differences accompanied with big smiles and wise nodding, welcome to the world of massive variation ranging from an Intel Atom to i7 to AMD and all of the old timers before them. Some give up in despair, others try for the magic bullet but most eventually settle for an algo that works on most things OK, has good averages and have an eye on the future hardware instructions.  :biggrin:

Antariy

Quote from: hutch-- on March 26, 2015, 09:25:25 AM
Alex,

One comment on your English practice, use more paragraphs as it breaks up big blocks of text and makes it a lot easier to read.

RE: hardware differences accompanied with big smiles and wise nodding, welcome to the world of massive variation ranging from an Intel Atom to i7 to AMD and all of the old timers before them. Some give up in despair, others try for the magic bullet but most eventually settle for an algo that works on most things OK, has good averages and have an eye on the future hardware instructions.  :biggrin:

Thank you, Hutch :biggrin:

Antariy

Quote from: nidud on March 26, 2015, 07:57:05 AM
Quote
The research say only about full nonsense in developing toooooo entangled and toooo bloated/sophisticated algos because all of them will finally hit to the same speed as the other ones.

If you increase the byte size to a magic number that's seems to be the case. The memcpy test suggest the number is between 1000 to 2000 bytes before the "smart-ass technology" kicks in.

If you carefully read those big postes than you will see that these "arguments" are not arguments at all. With no dependency to some "magic numbers" there will not be any speedup in real world execution, when you execute algo NOT thousands of times over the same data, constantly. If you do not want to read/get that big posts, then you should at least agree to disagree with them, but not try to "disprove" them being not trying to understand them, otherwise the discussion is just demagogical.

nidud

#43
deleted

Antariy

Quote from: nidud on March 27, 2015, 09:19:55 AM
Quote from: Antariy on March 27, 2015, 06:28:27 AM
If you increase the byte size to a magic number that's seems to be the case. The memcpy test suggest the number is between 1000 to 2000 bytes before the "smart-ass technology" kicks in.

This has to do with hardware where new technology do things differently. You may read more about the magic number here. In this test you will see how the SIMD functions are eaten by a simple REP MOVSB at a spesific (magic) byte-size.

I know about rep movsd overspeed in specific circumstances, but that comparsion is irrelevant for this research - here we talk about different things, and those things are "level deeper" than the movsd/movdqa comparsion. Actually you may think that this thread explains why movsd is faster when byte count is high enough - if you did read this thread, you saw that there is written the point which claims that in real world conditions the data which is used by algos which fall under the point 3) of conclusion of this research, is always not in the cache. So, "shortcuts" in the CPU circuit which do non-temporal direct move not messing up the cache is the reason why movsd is faster with higher byte counts. The movsd/stosd with high bytecounts is working bypassing the cache, so you may notice that it works in the conditions which this research forced the algos to work in so they show their drawbacks on too overcomplicated design.

Quote from: nidud on March 27, 2015, 09:19:55 AM
QuoteIf you carefully read those big postes

:P

Carefully read, and forget to add: and try to understand before trying to disprove.

Quote from: nidud on March 27, 2015, 09:19:55 AM
Quote
Well, with string sizes smaller than 16 the "fastest" algo is the first one, one-XMM reg using. If the string is 16+ with the terminator it will be 17+, so it will not fit into the one XMM reg so it will require additional execution and iteration, so after than the algo becomes "slower".

The second, complex algo, has higher overhead of complex entry than the third (nidud's) algo, so that's why it's little bit slower on short strings.

This I agree with Alex. It's logic not magic. And here comes the interesting part:
Quote
Second algo shows the "cost" of the complex entry + huge data fetch code which cannot "pay its cost back" with too short strings.

So here is (in my humble opinion) where you have to do some research:
1. What type of application needs a fast string library ?
2. How is the code/data organized in this application ?
3. What is the average byte-size passed to this function ?

I do not have to do another research just because you did not want to agree with this one. This research said things which it was going to say and is useful for those who want to get it instead to disprove it due to lack of understanding of it. This discussion is something like the "Russel's teapot" question, you want to claim that there's teapot in the orbit between Mars and Earth, but you want from me to prove or to disprove that by other research - strange thing. You may take this as a thought that "you're right and I just do not want to do research because I afraid it will show the contrary to that what I've said on the previous research" - i.e., if you prefer to think that you're right here, think so, but I'm too lazy to do researches I asked for by the people who anyway will not listen to the results if they will not be as expected by those who asked for research. If you're not lazy, well, find a pretty decent real world app, substitute in it the strlen algo it uses, in 3 steps with 3 different algos - single reg XMM using, your 32 byte fetching, and then intel's 64 byte fetching. Each time do a test of the application and then try to find at least 5% difference in its performance. Those who agreed/learned something with this research in this thread - agreed with it, those who do not agree - those do not agree (but you should take your attention that those who agree are in bigger count of persons and even did not try to disprove something, because there is nothing to disput about or to disprove).

Quote from: nidud on March 27, 2015, 09:19:55 AM
Quote
As you see, the algos almost go "head to head". Starting from the 1 MB of the data and until the 64 megs.

Depending on the answer to point 3 this may be irrelevant for the test, but still useful information.

Quote1) In real world the data passed to the algos are almost always not in the cache, especially if that's data which doesn't get touched very frequently (many times per second).

This I disagree with as mention above but it depends on the answer to point 2.

Again, if you carefully will read the thread, you will notice that there is a very descriptive and simple explanation why you may assume that the algos falling under the point 3) work with the data not in the cache when the data was passed to them in a call. If you don't read or still disagree then I do not have to search for the "proving" why there's no teapot flying round the sun. The demagogic discussions are always built on the long explanations of someone, and the "trolling" from the other side which does not agree but makes short and not argumented statements - and there is no reason to continue to explain the things whose were many times explained in the thread. If you don't agree with those 3 points in the conclusion of the research, then there is nothing to add to the explanations whose were many times repeated. No any other real world testing will disprove any claim made in this research, the syntetical ones are not count. Those who disagree with this should continue to disagree or try to argumentally disprove any claim, otherwise the author of the research do not have to continue the discussion about very simple and descriptive explained research which may not be disproved in principle. Those who disagree with this - have right to disagree silently, otherwise such careless attention to the explanations looks more like trolling.