if you did read this thread, you saw that there is written the point which claims that in real world conditions the data which is used by algos which fall under the point 3) of conclusion of this research, is always not in the cache.
3) The algos which do not do more than one iteration over all the data passed to them, taking in account what said in two points above,
1) In real world the data passed to the algos are almost always not in the cache, especially if that's data which doesn't get touched very frequently (many times per second).
2) The data may even be not in RAM but in the paging file, which makes access to the data even slower.
This I have already answer, so it’s difficult to understand what you really mean here. Given there is no physical connection between the CPU and memory all data will be loaded into the cache. Using large buffers in hope of flushing it may not work out as you think.
You given the anwer, yes, but, still, it's technically incorrect. You do not understand how the CPU-cache and cache-memory subsystem works and look on all of it as just on the "black magic box". "All data" will never be loaded into cache otherwise there is no reason to use RAM, if the cache will be so big to contain entire "all data" of all programs (including the OS itself). The cache is the thing to speed up the access to very frequently accessed data, but not more than that. If you still, after entire thread and its discussion don't see where is the wrong thing in your assumtions and believes, how to prove that to you? If you prefer to decide that the tests used on the forum are right thing, and if you prefer to think that when you optimizing your code based on those timings - well, that is your right to do your optimization and testing as you want. But, remember, you will never be able to disprove the things which said in this research - nobody will be able to disprove that, and this is not pride speech, if you don't believe - disprove that or find someone who fill able to disprove that but you will search very long and finally do not find such someone. So it's better to spent that time optimizing your code in a ways you think are right. After all, optimizing your code based on the testbeds used on the forum will give you vague, but more or less right way to optimize code and the results you will get from that will be good. The thing which this research said was that you may receive the same good final results with not too-much-fanatic optimization tries.
Since you can’t bypass the cache my argument was to play along with it. I you paranoid about the same string being in the cache on repeated calls you may try to flip pointers:A db "string A",0
db 32/64/128 dup(0)
B db "string B",0
However, if you look into your own test you will see that even the test macro touches memory during each iteration, so my guess is that the mantra of "million calls with the same…" may not be that much of a problem.
Nevertheless, I was not going to make any arguments abut how the cache works, my argument was your claim about the test done in the forum used "million calls with the same…", which is simply not true.
This in not I paranoid - this is you're stubborn on your own and thus act ignorantly. The example with swapping two pointers to
very short strings, which will obviously still fit into the cache, and being executed on the, well, ok, no million, but half of million times over each string, you will execute the test in the conditions which will never happen to the code in the real world.
I don'n understand what you meant about "test macro", still, if you take your notice not too carelessly into the tests, you will see that the second tests, which showed the similarness of the timings for every algo tested, has only 10 iterations over the same string with the size bigger than the cache. And you may even decrease that number even lesser. Still, you continue to use citations and phrases without understanding what was said there, but do you really think this is a valid argument to argue or try to claim that the claims said in the research are not true? That's ignorance.
Well, you was not going to make any arguments (is not this sounds funny in the discussion, which is only based on arguments? if not on arguments, then on which basis then will you construct your disput?), well, ok, you do not make any arguments, but you disprove the claim of the research just because you "don't agree". Did not you see that this behaviour is the behaviour of fanatic belief in some thing? You will not make arguments, but you'll belive in what you think and that's all for you.
If you don't understand what is said here - it may be ignorance, or just attentionless, or just lack of understanding, then it is even more ignorant to try to disprove or make your own claims about "trueness" or not "trueness" of the research, and the citation about "millions of calls". If you don't understand what was in base of the research and if you don't believe - then this is your right to do your work as you want for yourself. But it looks like it is you who repeats to use some cications like mantras without understanding of them and the underlying things on which they are based. And the paranoid desire to (dis)prove something about the points you did not understand. If you think that your example with two pointers "swapped" is getting any closer to the real world - well, continue to believe in that. But the test with millions of the different (even short) strings passed to the algo consecuently will be the only test which will show the reasl peformance of your algo - that's what was said in the research, that's what was said by Hutch, too, and that's what was implemented by Jochen in a code using 200 MB of strings to test over strlen, and it showed the claims here true. But, still, these are not arguments which you, in your fanatic beliefs, make take your notice to.
So, "shortcuts" in the CPU circuit which do non-temporal direct move not messing up the cache is the reason why movsd is faster with higher byte counts. The movsd/stosd with high bytecounts is working bypassing the cache, so you may notice that it works in the conditions which this research forced the algos to work in so they show their drawbacks on too overcomplicated design.
When the conditions, which "the research forced the algos into", exceeds the magic number you only measure the technology-impact of the inner loop.
This is, as I was saying, interesting, but I'm not sure how relevant this is to testing in general. The research is fine but some of your conclusions I disagree with, especially the part of tests done here in this forum being void as a result of it.
Those "magic numbers" are nothing real with small buffers - you just measure the speed of the CPU-cache interconnection, those "magic numbers" are real only with larger strings and are not something "black boxed" but just a shortcuts in the CPU design which come to work only after some, specific for the design, byte count, and take notice - only after big bytecount, and the bigger the bytecount is, the more performant the code (MOVSD) is - the bigger the difference between SSE and MOVSD code, but, still, using some techniques which avoid cache trashing with moves, you may get the same performance with SSE code. Though, again, real world code then has no any gain from using complex and long SSE code, to get the same performance after all, as the rep movsd will provide. SSE is for number crunching - and there it might has advantages of cache help over the small data sets, and even without that help using big data sets the SSE still have gains just of more wider ALU arithmetic. But you may notice then, then for StrLen code its speed is only 2 times higher than regular non-scasb GPRs code in real world conditions. And being even 10 times overcomplicated and quasi-sophisticated, the code will still not be any faster than that in the real world. But in the testbeds used on the forum the code wil be faster even in 10 times, and the code writer will think that the development time he put into the code is not useless, even though the code in real world will not be faster more than 2 times than GPRs code. This is what you believe in - the idol of forums testbed's "millions of loops over the same short data". This idol is false, and in the real world and in any real app the code posted page earlier, being long at 68 bytes, will perform the same speed as the Intel's 800+ bytes code, you, maybe, find this "interesting", but you still continue to believe in your idol of testbed's timings. Well, there you might see the "magic promises" from your belief, but in the real application you will see the speed up which will be limited to the things you do not want to know of (instead of that you want to "disprove" them).