News:

Masm32 SDK description, downloads and other helpful links
Message to All Guests

Main Menu

Research on the algos design

Started by Antariy, March 24, 2015, 06:42:55 AM

Previous topic - Next topic

nidud

#15
deleted

Antariy

Gunther, thank you! :t Yes, if it's not boring, test it in home, too.

Steve, thank you for test! It seems that both of your tested CPUs have 2 MB of the cache.

Quote from: nidud on March 25, 2015, 01:47:10 AM
From my experience with testing small algos and doing a real test in the application they are used in shows consistency in the test results.

I will also add that the result from a real test is surprising. It's often difficult to see how many times these functions are called at any given sequence of code. A few small adjustments could then have a huge impact where these functions are called millions of times.

You missed the main point. You talk about "million times called" functions, but where did you see the application which does this with the same string consequently?
If you want to test such a function which may be called millions of times in the real world - well, ok, just do a testbed like Hutch said - create an array of pointers to short/long (as you want) strings, but the number of strings should be very high, so the total size of all strings in sum is much more big than the cache size, and you will see the real wolrd performance of the algo. All other "million times called" types of test are not showing the right result. The algo may be very fast within testbed but in reality it will not perform any faster than much more smaller algo, or even it will be slower. Yes, that's surprising, the same as in the test here you see that 3 algos of different design and size run at the same speed when tested in the real world conditions. You may take that or not, but, actually, yes, our tests here and in the Memcopy thread are actually almost useless, we posted a bunch of the functions but in real world all of them will perform on the same speed.

And, again, each point was described in the big posts - when we talking about those conditions and about algos which fall under point 3), it's impossible to argue, well, it's possible but that disput is useless since if the well explained posts are not proved anything - then there is nothing to disput. Each people has its own opinion and has the right to stick with that, this post was not the "eye-opening reveal", that was just an example which may prove something to those who find it useful.

jj2007

Quote from: Antariy on March 25, 2015, 06:13:10 AMjust do a testbed like Hutch said - create an array of pointers to short/long (as you want) strings, but the number of strings should be very high, so the total size of all strings in sum is much more big than the cache size, and you will see the real wolrd performance of the algo.

Note that the 'bible' testbed in reply #8 does exactly that. Here is a run with 50 bibles, 200 MB, 1519150 different pointers:
Len(): 54501 us for 1519150 lines, 200850450 bytes
len(): 120439 us for 1519150 lines, 200850450 bytes
AxStrLenSSE: 53723 us for 1519150 lines, 200850450 bytes
AxStrLenSSE1: 54899 us for 1519150 lines, 200850450 bytes

Len(): 53089 us for 1519150 lines, 200850450 bytes
len(): 111344 us for 1519150 lines, 200850450 bytes
AxStrLenSSE: 54654 us for 1519150 lines, 200850450 bytes
AxStrLenSSE1: 54684 us for 1519150 lines, 200850450 bytes

nidud

#18
deleted

Antariy

Quote from: nidud on March 25, 2015, 07:23:53 AM
My point was the connection between testing and real usage, and my experiece is that the test result is not perfect but valid in this regard.

It's "valid", but not always. Limitedly valid, let's say that. Yes, you may say "If the algo which is tested in an usual way is faster than other algos, then, almost always (key word is "almost") it will be faster than other's algos in real world".
You are reffering to an experience - but your experience being even unlimited will not disprove some simple, basical things, like the memory bus speed is, like the caching subsystem is (which works well only with the data which is constantly accessed and thus sitting in the cache at least "almost" constantly), and like memory subsystem of an OS is, which is even more slowing down things. OK, well, being reffering to your experience just couple of weeks ago you was fully sure that there is no possibility to write a safe algo and that the "deal" of a speed versus robust is acceptable. "Experience" is just a word, and it will not disprove basical things like caching/memory bus speed limit/OS paging. We all here may refer to our exclusive and limited (every people has limited experience) experience, but that isn't an argument in a disput, that is just a "side note" like "from your experience something was so".

Quote from: nidud on March 25, 2015, 07:23:53 AM
Quote from: Antariy on March 25, 2015, 06:13:10 AM
You missed the main point. You talk about "million times called" functions, but where did you see the application which does this with the same string consequently?

All the time where speed matters  :P

The sarcasm and the "tongue" smiley is not the argument. You are unable to disprove what was said here and you will not ever be able to disprove that - take that or not, every people has its own right to be a fanatic.

Try to find the example of algo which will call the strlen millions of times for the same string - consecuenly. Why the word is underlined? Because of you call it from time to time - it will be just in some milliseconds from the previous call flushed from the cache, so, the next call to the code will again go the slow - memory bus speed - way.

And, for your info, no any modern and "widely used" libraries which use strings massively, are actually using the zero terminated strings. Every more or less advanced library used binary strings which save their size in the string object, so the length of the string is computed only once - when it's converted from zero terminated to the string object used by the runtime. So all other operations do not require even any more calls to zero terminated strlen. Zero terminated strings are more like anachronism - just for their simplicity and standardness they are still used. Every other string saving mechanism will require additional info on the string object container, about data layout, for an instance - how the length of the string is stored. Before the string? Or in some field of the object structure? At which offset? And which is the order of bytes of that variable which holds the length? Why do you think the zero terminated strings are still used? Because the "programming world" is still "more C-like", and that simplicity of the string format was from the "portability" of the C, too. The simpler format - the more portable it's... with no regard that it's silly to call code every time you want to get string length.

The strlen optimisation is one of the most liked things to do by hobbist (and not only) programmers, but actually it has little to do with the real gain - it is just the "proud".
But one may hide after words "the speed matters" and do a maniacal things and believe that this produces more speed code. Well, to some extent the optimisation is a right thing, but not more than the hardware can provide. And for strlen the CPU ALU will always be faster than any other systems of the machine, so being kilobytes in size and "maniacally" optimised (but that's illusion), or being just hudred of bytes in size but optimized as much as it was required, the strlen algos will perform the same.

Quote from: nidud on March 25, 2015, 07:23:53 AM
The testbed I use passes a new pointer for each call randomly aligned. But in cases where speed matters these functions work in a loop and thus uses the same memory repeatedly. A call here and there to strlen is not that important speed wise.

Did you read the big post? Did you read what was written here, on these pages? You pass the "new pointer" which points to very limited set of very short strings which are all even being in 100 times longer will still fit into the cache of any more or less modern CPU. So this method of testing doesn't even gets close to the real world testing conditions. The proposal on testing which was written in the big main posts of the thread, the Hutch's advice and Jochen's implementation of the test is the right way of testing the algos which fall into 3) point which you may read in the main post of the thread.

In the cases where the speed important no any code will call the strlen twice - it will call it once and the save the string length, and will manipulate the string then as the binary string, because the therm "zero terminated string" is not speed-wise thing and doesn't fit to the "speed code" ideology.

And, the call to strlen from time to time with different strings - is that which the code will do. No any speed-oriented code will call the strlen more than once for every string.

This is just demagogical disput ::)

Quote from: nidud on March 25, 2015, 07:23:53 AM
Quote
If you want to test such a function which may be called millions of times in the real world - well, ok, just do a testbed like Hutch said - create an array of pointers to short/long (as you want) strings, but the number of strings should be very high, so the total size of all strings in sum is much more big than the cache size, and you will see the real wolrd performance of the algo.

I do this all the time in a real application to select which algo to use, and I don't see the cache as a problem in this case.

If you don't want to understand what said in the posts of the thread or, more probably, you don't want to agree with that, then I don't know what to say here. If you don't see any problems and think that you doing test a right way - do that, it's your right.

Your testbed is not bad - it's very comfortable to use (I tried it already) and has stable timings, maybe you get that the disput about the wrong way of testing is harming your work - but that's not so. That were points about some types of algos, and that was all more or less clearly stated in the main thread post.

Quote from: nidud on March 25, 2015, 07:23:53 AM
Quote
our tests here and in the Memcopy thread are actually almost useless
...
that was just an example which may prove something to those who find it useful.

I do find the rant useful but somewhat depressing given my experience is not the same.

You may take that as a rants, but probably there are the other peoples who see that it is useful and is true. Given from their experience, too, especially of they don't try to decline some basical things in favour of their pride.

That is your opinion, and you have right for it. For my opinion, for an instance, the Intel's 2014 strlen is just a marketing stuff and no more. Yes, big and "impressing" code, well, that's how the every business has to work - impress the users, and they will respect you. But after all that code in real conditions will work not faster than algo which is 10 times smaller? Don't believe? Well, that's your right. But more over, the bigger algo which falls to the point 3) will work even slower. Why? Because it is not always in 1st level code trace cache. And being executed let's say one microsecond later than the previous call to it was done, the algo, being long and bloated, will be longer decoded, the shorter algo will actually already return the string length but the bloated but "impressive looking" algo will at the same time will be only decoding to be fed to the CPU core. Well, but this is probably rant, too, so if anyone has right to stick with its opinion - the disput is useless.

Antariy

Quote from: jj2007 on March 25, 2015, 06:50:53 AM
Quote from: Antariy on March 25, 2015, 06:13:10 AMjust do a testbed like Hutch said - create an array of pointers to short/long (as you want) strings, but the number of strings should be very high, so the total size of all strings in sum is much more big than the cache size, and you will see the real wolrd performance of the algo.

Note that the 'bible' testbed in reply #8 does exactly that. Here is a run with 50 bibles, 200 MB, 1519150 different pointers:
Len(): 54501 us for 1519150 lines, 200850450 bytes
len(): 120439 us for 1519150 lines, 200850450 bytes
AxStrLenSSE: 53723 us for 1519150 lines, 200850450 bytes
AxStrLenSSE1: 54899 us for 1519150 lines, 200850450 bytes

Len(): 53089 us for 1519150 lines, 200850450 bytes
len(): 111344 us for 1519150 lines, 200850450 bytes
AxStrLenSSE: 54654 us for 1519150 lines, 200850450 bytes
AxStrLenSSE1: 54684 us for 1519150 lines, 200850450 bytes


Thank you, Jochen, that's the right implementation of the test for this kind of algo (from my subjective point of view) :t
Actually, if you did tested the "SSE1" code from this thread in that test - then I would recommend not to used it for the everyday work, because it was unfinished and more over it was just "dirty hacked" into testbed. It works properly in the testbed - and you may be sure in that because first thing the test is doing is the checking if the algos do return right results - but it doesn't designed to be used as the algo in the applications because I was too lazy - it's a big and "on fast hand" redesign of almost different algo and is sloppy, but it does its work properly and works for the testing and research purposes described here. Anyone may just replace it with Intel's Silvermont algo if one has any doubts on the idea proved in this research.

nidud

#21
deleted

rrr314159

#22
AMD, 2 mb L2 cache

Intel, 1 mb L2, 6 mb L3

(I think)

This testing is a PITA I expect at least a 1500-word post from you after all this trouble :)
(From other thread, Antariy advice:) On AMD use ax with bsf - Thanks! I should have thought of that :redface:

[edit] that smiley :P isn't sticking out his tongue - is he?
I am NaN ;)

jj2007

Quote from: nidud on March 25, 2015, 11:24:30 AMWhen you tokenize a text file you tend to reuse the same memory. The same with reading files where you use the same file block which will end up in the cache before you call strlen to get the length of the file.

When you tokenise a 200MB file, then only a tiny fraction is in the data cache. Afterwards, the pointers may fit into the cache, but the algo needs to examine the 200MB content, not the pointers.

> before you call strlen to get the length of the file.

That sounds as if you want to use strlen for getting the 200MB info. That is hardly a real world usage; you open the file, you know its size. What you don't know yet is the length of many tiny strings (actually, MasmBasic stores the lengths during the tokenising process, but that's another topic).

nidud

#24
deleted

jj2007

Hi nidud,

It seems we have a serious communication problem - probably my fault.

What I am trying to say is:
- cache influence doesn't matter if you read in or stream in 200 MB; in contrasting to testing one (cached) string a Million times, the algo needs to access memory if the file size is well beyond the 2 or 3 MB of the L3 cache.

- the average string len of asm files is around 20 bytes, sure, for C headers it is around 35 bytes, and for the bible it's 132 bytes; so for testing real world apps, around 50 bytes might be a good compromise.

Another test for non-cached files: sorting a text file. Again, bible.txt is being used, and two algos are being compared:
Intel(R) Core(TM) i5-2450M CPU @ 2.50GHz
Masm32 assort:
30382 lines sorted in 0.014 seconds
30382 lines sorted in 0.011 seconds
30382 lines sorted in 0.012 seconds
30382 lines sorted in 0.012 seconds
30382 lines sorted in 0.011 seconds
MasmBasic QSort:
30383 lines sorted in 0.012 seconds
30383 lines sorted in 0.012 seconds
30383 lines sorted in 0.012 seconds
30383 lines sorted in 0.012 seconds
30383 lines sorted in 0.012 seconds


Bible.txt is 4MB, so it can only partially be cached. The attached archive contains a BibleSort50.exe, which creates and tests a 200MB file from bible.txt. And that one definitely doesn't fit into the cache, so be prepared for big differences between CPUs and memory characteristics like bus speed and cache size.

hutch--

Bible is not a bad test piece at about 4.5 meg, if you need a decent big sample, just knock yourself up a multiple copy version, copy it onto itself until its a few hundred meg, then you will know what a linear algo speed is like. If you have a reasonable tokeniser, tokenise it first THEN time its read/write/modify speed on a massive array.

jj2007

Quote from: hutch-- on March 25, 2015, 10:13:59 PMif you need a decent big sample, just knock yourself up a multiple copy version, copy it onto itself until its a few hundred meg

Good idea :t
Quote from: jj2007 on March 25, 2015, 10:07:30 PMThe attached archive contains a BibleSort50.exe, which creates and tests a 200MB file from bible.txt.
;)

dedndave

god's gonna strike you down for using the bible as "random data"   :badgrin:

jj2007

Quote from: dedndave on March 25, 2015, 10:44:12 PM
god's gonna strike you down for using the bible as "random data"   :badgrin:

I am just providing this important text in a more ordered version 8)