News:

Masm32 SDK description, downloads and other helpful links
Message to All Guests

Main Menu

Research on the algos design

Started by Antariy, March 24, 2015, 06:42:55 AM

Previous topic - Next topic

nidud

#45
deleted

Antariy

Quote from: nidud on March 29, 2015, 06:58:20 AM
Quote from: Antariy on March 29, 2015, 04:59:28 AM
if you did read this thread, you saw that there is written the point which claims that in real world conditions the data which is used by algos which fall under the point 3) of conclusion of this research, is always not in the cache.
Quote
3) The algos which do not do more than one iteration over all the data passed to them, taking in account what said in two points above,
Quote
1) In real world the data passed to the algos are almost always not in the cache, especially if that's data which doesn't get touched very frequently (many times per second).
2) The data may even be not in RAM but in the paging file, which makes access to the data even slower.

This I have already answer, so it's difficult to understand what you really mean here. Given there is no physical connection between the CPU and memory all data will be loaded into the cache. Using large buffers in hope of flushing it may not work out as you think.

You given the anwer, yes, but, still, it's technically incorrect. You do not understand how the CPU-cache and cache-memory subsystem works and look on all of it as just on the "black magic box". "All data" will never be loaded into cache otherwise there is no reason to use RAM, if the cache will be so big to contain entire "all data" of all programs (including the OS itself). The cache is the thing to speed up the access to very frequently accessed data, but not more than that. If you still, after entire thread and its discussion don't see where is the wrong thing in your assumtions and believes, how to prove that to you? If you prefer to decide that the tests used on the forum are right thing, and if you prefer to think that when you optimizing your code based on those timings - well, that is your right to do your optimization and testing as you want. But, remember, you will never be able to disprove the things which said in this research - nobody will be able to disprove that, and this is not pride speech, if you don't believe - disprove that or find someone who fill able to disprove that but you will search very long and finally do not find such someone. So it's better to spent that time optimizing your code in a ways you think are right. After all, optimizing your code based on the testbeds used on the forum will give you vague, but more or less right way to optimize code and the results you will get from that will be good. The thing which this research said was that you may receive the same good final results with not too-much-fanatic optimization tries.


Quote from: nidud on March 29, 2015, 06:58:20 AM
Since you can't bypass the cache my argument was to play along with it. I you paranoid about the same string being in the cache on repeated calls you may try to flip pointers:
A db "string A",0
db 32/64/128 dup(0)
B db "string B",0


However, if you look into your own test you will see that even the test macro touches memory during each iteration, so my guess is that the mantra of "million calls with the same..." may not be that much of a problem.

Nevertheless, I was not going to make any arguments abut how the cache works, my argument was your claim about the test done in the forum used "million calls with the same...", which is simply not true.

This in not I paranoid - this is you're stubborn on your own and thus act ignorantly. The example with swapping two pointers to very short strings, which will obviously still fit into the cache, and being executed on the, well, ok, no million, but half of million times over each string, you will execute the test in the conditions which will never happen to the code in the real world.

I don'n understand what you meant about "test macro", still, if you take your notice not too carelessly into the tests, you will see that the second tests, which showed the similarness of the timings for every algo tested, has only 10 iterations over the same string with the size bigger than the cache. And you may even decrease that number even lesser. Still, you continue to use citations and phrases without understanding what was said there, but do you really think this is a valid argument to argue or try to claim that the claims said in the research are not true? That's ignorance.

Well, you was not going to make any arguments (is not this sounds funny in the discussion, which is only based on arguments? if not on arguments, then on which basis then will you construct your disput?), well, ok, you do not make any arguments, but you disprove the claim of the research just because you "don't agree". Did not you see that this behaviour is the behaviour of fanatic belief in some thing? You will not make arguments, but you'll belive in what you think and that's all for you.

If you don't understand what is said here - it may be ignorance, or just attentionless, or just lack of understanding, then it is even more ignorant to try to disprove or make your own claims about "trueness" or not "trueness" of the research, and the citation about "millions of calls". If you don't understand what was in base of the research and if you don't believe - then this is your right to do your work as you want for yourself. But it looks like it is you who repeats to use some cications like mantras without understanding of them and the underlying things on which they are based. And the paranoid desire to (dis)prove something about the points you did not understand. If you think that your example with two pointers "swapped" is getting any closer to the real world - well, continue to believe in that. But the test with millions of the different (even short) strings passed to the algo consecuently will be the only test which will show the reasl peformance of your algo - that's what was said in the research, that's what was said by Hutch, too, and that's what was implemented by Jochen in a code using 200 MB of strings to test over strlen, and it showed the claims here true. But, still, these are not arguments which you, in your fanatic beliefs, make take your notice to.

Quote from: nidud on March 29, 2015, 06:58:20 AM
Quote
So, "shortcuts" in the CPU circuit which do non-temporal direct move not messing up the cache is the reason why movsd is faster with higher byte counts. The movsd/stosd with high bytecounts is working bypassing the cache, so you may notice that it works in the conditions which this research forced the algos to work in so they show their drawbacks on too overcomplicated design.

When the conditions, which "the research forced the algos into", exceeds the magic number you only measure the technology-impact of the inner loop.

This is, as I was saying, interesting, but I'm not sure how relevant this is to testing in general. The research is fine but some of your conclusions I disagree with, especially the part of tests done here in this forum being void as a result of it.

Those "magic numbers" are nothing real with small buffers - you just measure the speed of the CPU-cache interconnection, those "magic numbers" are real only with larger strings and are not something "black boxed" but just a shortcuts in the CPU design which come to work only after some, specific for the design, byte count, and take notice - only after big bytecount, and the bigger the bytecount is, the more performant the code (MOVSD) is - the bigger the difference between SSE and MOVSD code, but, still, using some techniques which avoid cache trashing with moves, you may get the same performance with SSE code. Though, again, real world code then has no any gain from using complex and long SSE code, to get the same performance after all, as the rep movsd will provide. SSE is for number crunching - and there it might has advantages of cache help over the small data sets, and even without that help using big data sets the SSE still have gains just of more wider ALU arithmetic. But you may notice then, then for StrLen code its speed is only 2 times higher than regular non-scasb GPRs code in real world conditions. And being even 10 times overcomplicated and quasi-sophisticated, the code will still not be any faster than that in the real world. But in the testbeds used on the forum the code wil be faster even in 10 times, and the code writer will think that the development time he put into the code is not useless, even though the code in real world will not be faster more than 2 times than GPRs code. This is what you believe in - the idol of forums testbed's "millions of loops over the same short data". This idol is false, and in the real world and in any real app the code posted page earlier, being long at 68 bytes, will perform the same speed as the Intel's 800+ bytes code, you, maybe, find this "interesting", but you still continue to believe in your idol of testbed's timings. Well, there you might see the "magic promises" from your belief, but in the real application you will see the speed up which will be limited to the things you do not want to know of (instead of that you want to "disprove" them).

jj2007

Quote from: nidud on March 29, 2015, 06:58:20 AMGiven there is no physical connection between the CPU and memory all data will be loaded into the cache. Using large buffers in hope of flushing it may not work out as you think.

Since you can't bypass the cache my argument was to play along with it. I you paranoid about the same string being in the cache on repeated calls you may try to flip pointers:
A db "string A",0
db 32/64/128 dup(0)
B db "string B",0

I am afraid that Alex is right here. There is a reason why I am reading multiple copies of bible.txt in the fat files thread; and the reason is precisely that modern CPUs have several data caches - see Multi-level caches. Level 1, the fastest, was 8k in a P4; today, the Haswell architecture features three levels:
QuoteL1 cache    64 KB per core
L2 cache    256 KB per core
L3 cache    2 MB to 20 MB shared

It is obvious that even if you swap pointers, your complete data are in L1. If you use instead a 200MB file, you may still have some minor influence of L3 (typically 2MB nowadays, 1% of 200MB), but the influence of L1 and L2 becomes negligible. This is why timings differ dramatically for a) multiple access to a small memory area in cache vs b) few accesses to a big memory area that must almost completely shuffled into the cache before it can be used by the CPU.

nidud

#48
deleted

nidud

#49
deleted

hutch--

I wonder at some of the comments in this enduring debate. The vast majority of relevant information is contained in data and code is only a factor if it is big enough to break the code cache, a situation that is best avoided for maximum performance. Now depending on the age and type of hardware, there have been different strategies to load data from main storage into the processing component of hardware and we know that in any of the more recent processors that you have a sequence of caches that increase in speed and decrease in size as they get closer to the data processing component of the hardware.

Now if its data that is specific to the algorithm (lookup table or similar) you need to keep its size down and have it accessed so that it remains as close as possible to the processing component of the hardware. The rest is data that you are going to process. You shift from data storage to a succession of data caches that become progressively closer to the processing component of the hardware. This was done historically to improve the throughput of data rather than have to keep accessing the data in main storage or as a single block of allocated memory.

Now apply that to testing algorithms and you have some guidelines as to what you need to do to create a test for different algorithms that are being applied to the same task. At its simplest if you need to test combinations of mnemonics without regard to the data being processed then short loops in level 1 cache can do the job OK as long as you understand the limitation of this testing technique.

At the other end of the scale, if you have to process very large data in a linear manner, you need to construct a test that emulates the conditions you will use the algorithm for. A gigabyte sized data file does the job here as it emulates the cache load and in some cases the write back of data to either the same memory or an allocated block at another address. Another consideration here is the option in later SSE instructions to make non temporal writes to avoid two way cache pollution. There are contexts where cached reads and non temporal writes give big performance gains if executed properly.

Between these two ends if a massive range of different tasks that differ in terms of data access and iteration rate. A reasonable amount of code is called at an incidental level rather than high frequency hammering and the test here is attack, not necessarily linear read time. Designing a test for this type of condition is a lot harder than it looks because while you may be able to design an algorithm that is a super star in the test bed, unless you match the algorithm to the task being performed it runs the risk of getting its ass kicked by another algorithm that better matches the task.

jj2007

Quote from: nidud on March 31, 2015, 05:56:36 AMTrue, using the cache is faster. However, in one call (assuming 200M long string) none of the fetched lines will be reused but nevertheless pushed into the cache.

And this is the exact reason why processing already cached data (e.g. two little strings) is different from processing 200MB that must be pushed into the cache before they become available to the CPU.

nidud

#52
deleted

jj2007

Quote from: nidud on April 01, 2015, 02:04:49 AMThe pushing in this case happens after you finish processing the data you already have copied from memory.

Oh really?

include \masm32\MasmBasic\MasmBasic.inc      ; download
.data
S1      db "Hello, this is a simple string intended for testing string algos. It has 132 characters without zero, as the text file loaded below.", 0
S2      db "Hello, this is a simple string intended for testing string algos. It has 132 characters without zero, as the text file loaded below.", 0
  Init
  Recall "Bible200.txt", L$()
  mov ebx, eax      ; #of strings
  PrintCpu 0
  Print "Little exercise showing the influence of the cache:"
  REPEAT 3
      xor ecx, ecx      ;string counter
      xor edi, edi      ; len counter
      NanoTimer()
      .Repeat
            add edi, Len(L$(ecx))
            inc ecx
      .Until ecx>=ebx
      Print Str$("\n%i ms to get the length of ", NanoTimer(ms)), Str$("%i strings", ecx), Str$(" with %i total bytes", edi)
     
      xor ecx, ecx      ;string counter
      xor edi, edi      ; len counter
      NanoTimer()
      .Repeat
            add edi, Len(offset S1)
            inc ecx
      .Until ecx>=ebx
      Print Str$("\n%i ms ", NanoTimer(ms)), Str$("to get %i times the length of 1 string", ecx), Str$(" with %i total bytes", edi)
     
      xor ecx, ecx      ;string counter
      xor edi, edi      ; len counter
      NanoTimer()
      .Repeat
            add edi, Len(offset S1)
            inc ecx
            add edi, Len(offset S2)
            inc ecx
      .Until ecx>=ebx
      Print Str$("\n%i ms ", NanoTimer(ms)), Str$("to get %i times the length of 2 strings", ecx/2), Str$(" with %i total bytes\n", edi)
  ENDM
  Inkey "And now??"
  Exit
end start


Output:
Intel(R) Core(TM) i5-2450M CPU @ 2.50GHz
Little exercise showing the influence of the cache:
234 ms to get the length of 6076600 strings with 803401800 total bytes
67 ms to get 6076600 times the length of 1 string with 802111200 total bytes
69 ms to get 3038300 times the length of 2 strings with 802111200 total bytes

236 ms to get the length of 6076600 strings with 803401800 total bytes
67 ms to get 6076600 times the length of 1 string with 802111200 total bytes
68 ms to get 3038300 times the length of 2 strings with 802111200 total bytes

237 ms to get the length of 6076600 strings with 803401800 total bytes
66 ms to get 6076600 times the length of 1 string with 802111200 total bytes
68 ms to get 3038300 times the length of 2 strings with 802111200 total bytes

nidud

#54
deleted

jj2007

Believe what you want to believe. Why am I wasting my time here? ::)

rrr314159

Quote from: jj2007 on April 01, 2015, 08:48:40 AM
Why am I wasting my time here? ::)

... when I could be wasting it somewhere else? :biggrin:
I am NaN ;)

nidud

#57
deleted

Antariy

Hey, guys, why not to review and revive this old thread again? Let's let TIMEWASTING HOLYWAR (or shit war? haha) to CONTINUE!

To Hutch, this is really really a joke :D Don't be stressfull, not going to continue this thread, just was fun to stumble over it.
Quote from: hutch-- on March 31, 2015, 10:44:02 AM
unless you match the algorithm to the task being performed it runs the risk of getting its ass kicked by another algorithm that better matches the task.

CPUs and their caches gone far more FAT (no, not FAT12/nor FAT16/not FAT32) and powerful for the time passed, but, the rules of the game are still the same, huh.

hutch--

Hi Alex,

Glad there is someone left in the world with a sense of humour.  :tongue: