Author Topic: Research on the algos design  (Read 10272 times)

nidud

  • Member
  • *****
  • Posts: 1371
    • https://github.com/nidud/asmc
Re: Research on the algos design
« Reply #15 on: March 25, 2015, 01:47:10 AM »
From my experience with testing small algos and doing a real test in the application they are used in shows consistency in the test results.

I will also add that the result from a real test is surprising. It’s often difficult to see how many times these functions are called at any given sequence of code. A few small adjustments could then have a huge impact where these functions are called millions of times.

Antariy

  • Member
  • ****
  • Posts: 541
Re: Research on the algos design
« Reply #16 on: March 25, 2015, 06:13:10 AM »
Gunther, thank you! :t Yes, if it's not boring, test it in home, too.

Steve, thank you for test! It seems that both of your tested CPUs have 2 MB of the cache.

From my experience with testing small algos and doing a real test in the application they are used in shows consistency in the test results.

I will also add that the result from a real test is surprising. It’s often difficult to see how many times these functions are called at any given sequence of code. A few small adjustments could then have a huge impact where these functions are called millions of times.

You missed the main point. You talk about "million times called" functions, but where did you see the application which does this with the same string consequently?
If you want to test such a function which may be called millions of times in the real world - well, ok, just do a testbed like Hutch said - create an array of pointers to short/long (as you want) strings, but the number of strings should be very high, so the total size of all strings in sum is much more big than the cache size, and you will see the real wolrd performance of the algo. All other "million times called" types of test are not showing the right result. The algo may be very fast within testbed but in reality it will not perform any faster than much more smaller algo, or even it will be slower. Yes, that's surprising, the same as in the test here you see that 3 algos of different design and size run at the same speed when tested in the real world conditions. You may take that or not, but, actually, yes, our tests here and in the Memcopy thread are actually almost useless, we posted a bunch of the functions but in real world all of them will perform on the same speed.

And, again, each point was described in the big posts - when we talking about those conditions and about algos which fall under point 3), it's impossible to argue, well, it's possible but that disput is useless since if the well explained posts are not proved anything - then there is nothing to disput. Each people has its own opinion and has the right to stick with that, this post was not the "eye-opening reveal", that was just an example which may prove something to those who find it useful.

jj2007

  • Member
  • *****
  • Posts: 7558
  • Assembler is fun ;-)
    • MasmBasic
Re: Research on the algos design
« Reply #17 on: March 25, 2015, 06:50:53 AM »
just do a testbed like Hutch said - create an array of pointers to short/long (as you want) strings, but the number of strings should be very high, so the total size of all strings in sum is much more big than the cache size, and you will see the real wolrd performance of the algo.

Note that the 'bible' testbed in reply #8 does exactly that. Here is a run with 50 bibles, 200 MB, 1519150 different pointers:
Code: [Select]
Len(): 54501 us for 1519150 lines, 200850450 bytes
len(): 120439 us for 1519150 lines, 200850450 bytes
AxStrLenSSE: 53723 us for 1519150 lines, 200850450 bytes
AxStrLenSSE1: 54899 us for 1519150 lines, 200850450 bytes

Len(): 53089 us for 1519150 lines, 200850450 bytes
len(): 111344 us for 1519150 lines, 200850450 bytes
AxStrLenSSE: 54654 us for 1519150 lines, 200850450 bytes
AxStrLenSSE1: 54684 us for 1519150 lines, 200850450 bytes

nidud

  • Member
  • *****
  • Posts: 1371
    • https://github.com/nidud/asmc
Re: Research on the algos design
« Reply #18 on: March 25, 2015, 07:23:53 AM »
My point was the connection between testing and real usage, and my experiece is that the test result is not perfect but valid in this regard.

You missed the main point. You talk about "million times called" functions, but where did you see the application which does this with the same string consequently?

All the time where speed matters  :P

The testbed I use passes a new pointer for each call randomly aligned. But in cases where speed matters these functions work in a loop and thus uses the same memory repeatedly. A call here and there to strlen is not that important speed wise.

Quote
If you want to test such a function which may be called millions of times in the real world - well, ok, just do a testbed like Hutch said - create an array of pointers to short/long (as you want) strings, but the number of strings should be very high, so the total size of all strings in sum is much more big than the cache size, and you will see the real wolrd performance of the algo.

I do this all the time in a real application to select which algo to use, and I don’t see the cache as a problem in this case.

Quote
our tests here and in the Memcopy thread are actually almost useless
...
that was just an example which may prove something to those who find it useful.

I do find the rant useful but somewhat depressing given my experience is not the same.

Antariy

  • Member
  • ****
  • Posts: 541
Re: Research on the algos design
« Reply #19 on: March 25, 2015, 08:24:01 AM »
My point was the connection between testing and real usage, and my experiece is that the test result is not perfect but valid in this regard.

It's "valid", but not always. Limitedly valid, let's say that. Yes, you may say "If the algo which is tested in an usual way is faster than other algos, then, almost always (key word is "almost") it will be faster than other's algos in real world".
You are reffering to an experience - but your experience being even unlimited will not disprove some simple, basical things, like the memory bus speed is, like the caching subsystem is (which works well only with the data which is constantly accessed and thus sitting in the cache at least "almost" constantly), and like memory subsystem of an OS is, which is even more slowing down things. OK, well, being reffering to your experience just couple of weeks ago you was fully sure that there is no possibility to write a safe algo and that the "deal" of a speed versus robust is acceptable. "Experience" is just a word, and it will not disprove basical things like caching/memory bus speed limit/OS paging. We all here may refer to our exclusive and limited (every people has limited experience) experience, but that isn't an argument in a disput, that is just a "side note" like "from your experience something was so".

You missed the main point. You talk about "million times called" functions, but where did you see the application which does this with the same string consequently?

All the time where speed matters  :P

The sarcasm and the "tongue" smiley is not the argument. You are unable to disprove what was said here and you will not ever be able to disprove that - take that or not, every people has its own right to be a fanatic.

Try to find the example of algo which will call the strlen millions of times for the same string - consecuenly. Why the word is underlined? Because of you call it from time to time - it will be just in some milliseconds from the previous call flushed from the cache, so, the next call to the code will again go the slow - memory bus speed - way.

And, for your info, no any modern and "widely used" libraries which use strings massively, are actually using the zero terminated strings. Every more or less advanced library used binary strings which save their size in the string object, so the length of the string is computed only once - when it's converted from zero terminated to the string object used by the runtime. So all other operations do not require even any more calls to zero terminated strlen. Zero terminated strings are more like anachronism - just for their simplicity and standardness they are still used. Every other string saving mechanism will require additional info on the string object container, about data layout, for an instance - how the length of the string is stored. Before the string? Or in some field of the object structure? At which offset? And which is the order of bytes of that variable which holds the length? Why do you think the zero terminated strings are still used? Because the "programming world" is still "more C-like", and that simplicity of the string format was from the "portability" of the C, too. The simpler format - the more portable it's... with no regard that it's silly to call code every time you want to get string length.

The strlen optimisation is one of the most liked things to do by hobbist (and not only) programmers, but actually it has little to do with the real gain - it is just the "proud".
But one may hide after words "the speed matters" and do a maniacal things and believe that this produces more speed code. Well, to some extent the optimisation is a right thing, but not more than the hardware can provide. And for strlen the CPU ALU will always be faster than any other systems of the machine, so being kilobytes in size and "maniacally" optimised (but that's illusion), or being just hudred of bytes in size but optimized as much as it was required, the strlen algos will perform the same.

The testbed I use passes a new pointer for each call randomly aligned. But in cases where speed matters these functions work in a loop and thus uses the same memory repeatedly. A call here and there to strlen is not that important speed wise.

Did you read the big post? Did you read what was written here, on these pages? You pass the "new pointer" which points to very limited set of very short strings which are all even being in 100 times longer will still fit into the cache of any more or less modern CPU. So this method of testing doesn't even gets close to the real world testing conditions. The proposal on testing which was written in the big main posts of the thread, the Hutch's advice and Jochen's implementation of the test is the right way of testing the algos which fall into 3) point which you may read in the main post of the thread.

In the cases where the speed important no any code will call the strlen twice - it will call it once and the save the string length, and will manipulate the string then as the binary string, because the therm "zero terminated string" is not speed-wise thing and doesn't fit to the "speed code" ideology.

And, the call to strlen from time to time with different strings - is that which the code will do. No any speed-oriented code will call the strlen more than once for every string.

This is just demagogical disput ::)

Quote
If you want to test such a function which may be called millions of times in the real world - well, ok, just do a testbed like Hutch said - create an array of pointers to short/long (as you want) strings, but the number of strings should be very high, so the total size of all strings in sum is much more big than the cache size, and you will see the real wolrd performance of the algo.

I do this all the time in a real application to select which algo to use, and I don’t see the cache as a problem in this case.

If you don't want to understand what said in the posts of the thread or, more probably, you don't want to agree with that, then I don't know what to say here. If you don't see any problems and think that you doing test a right way - do that, it's your right.

Your testbed is not bad - it's very comfortable to use (I tried it already) and has stable timings, maybe you get that the disput about the wrong way of testing is harming your work - but that's not so. That were points about some types of algos, and that was all more or less clearly stated in the main thread post.

Quote
our tests here and in the Memcopy thread are actually almost useless
...
that was just an example which may prove something to those who find it useful.

I do find the rant useful but somewhat depressing given my experience is not the same.

You may take that as a rants, but probably there are the other peoples who see that it is useful and is true. Given from their experience, too, especially of they don't try to decline some basical things in favour of their pride.

That is your opinion, and you have right for it. For my opinion, for an instance, the Intel's 2014 strlen is just a marketing stuff and no more. Yes, big and "impressing" code, well, that's how the every business has to work - impress the users, and they will respect you. But after all that code in real conditions will work not faster than algo which is 10 times smaller? Don't believe? Well, that's your right. But more over, the bigger algo which falls to the point 3) will work even slower. Why? Because it is not always in 1st level code trace cache. And being executed let's say one microsecond later than the previous call to it was done, the algo, being long and bloated, will be longer decoded, the shorter algo will actually already return the string length but the bloated but "impressive looking" algo will at the same time will be only decoding to be fed to the CPU core. Well, but this is probably rant, too, so if anyone has right to stick with its opinion - the disput is useless.

Antariy

  • Member
  • ****
  • Posts: 541
Re: Research on the algos design
« Reply #20 on: March 25, 2015, 08:32:19 AM »
just do a testbed like Hutch said - create an array of pointers to short/long (as you want) strings, but the number of strings should be very high, so the total size of all strings in sum is much more big than the cache size, and you will see the real wolrd performance of the algo.

Note that the 'bible' testbed in reply #8 does exactly that. Here is a run with 50 bibles, 200 MB, 1519150 different pointers:
Code: [Select]
Len(): 54501 us for 1519150 lines, 200850450 bytes
len(): 120439 us for 1519150 lines, 200850450 bytes
AxStrLenSSE: 53723 us for 1519150 lines, 200850450 bytes
AxStrLenSSE1: 54899 us for 1519150 lines, 200850450 bytes

Len(): 53089 us for 1519150 lines, 200850450 bytes
len(): 111344 us for 1519150 lines, 200850450 bytes
AxStrLenSSE: 54654 us for 1519150 lines, 200850450 bytes
AxStrLenSSE1: 54684 us for 1519150 lines, 200850450 bytes

Thank you, Jochen, that's the right implementation of the test for this kind of algo (from my subjective point of view) :t
Actually, if you did tested the "SSE1" code from this thread in that test - then I would recommend not to used it for the everyday work, because it was unfinished and more over it was just "dirty hacked" into testbed. It works properly in the testbed - and you may be sure in that because first thing the test is doing is the checking if the algos do return right results - but it doesn't designed to be used as the algo in the applications because I was too lazy - it's a big and "on fast hand" redesign of almost different algo and is sloppy, but it does its work properly and works for the testing and research purposes described here. Anyone may just replace it with Intel's Silvermont algo if one has any doubts on the idea proved in this research.

nidud

  • Member
  • *****
  • Posts: 1371
    • https://github.com/nidud/asmc
Re: Research on the algos design
« Reply #21 on: March 25, 2015, 11:24:30 AM »
You are reffering to an experience - but your experience being even unlimited will not disprove some simple, basical things, like the memory bus speed is, like the caching subsystem is..
Quote
What idea does this research prove.

We all know and like to write the algos and to test them on the lab. The problem is in the method which is used in that. Currently we "test" the algos against short (relatively to the 2nd level data cache, read below for explanation) pieces of data, and the test runs many times on the same short data. Being executed in these conditions, some types of algos do not show the real world performance of their.

After first (at least after couple more) interations of the test - call to the code tested and return from it - the short data is already placed in the cache, so the next iterations of the test are going in an unfair (read below why) conditions for the real world performance of the code. So, just first iteration runs with the real world performance, the next, many thousands of iterations run in not real world but "static" conditions, so in final average results the timing of the first run has no influence to show the real performance.

In the testbed the functions are read into a virtual protected area of the code segment, and they are all called in a validation test before the timing begins. This means they all start with all the data already in the cache.

Quote
OK, well, being reffering to your experience just couple of weeks ago you was fully sure that there is no possibility to write a safe algo and that the "deal" of a speed versus robust is acceptable.

Don't remember saying it was impossible or acceptable, but I assumed there was no danger in the environment where the functions was used.

Quote
"Experience" is just a word, and it will not disprove basical things like caching/memory bus speed limit/OS paging. We all here may refer to our exclusive and limited (every people has limited experience) experience, but that isn't an argument in a disput, that is just a "side note" like "from your experience something was so".

In this case the "experience" is based on testing with some good results.

Quote
You missed the main point. You talk about "million times called" functions, but where did you see the application which does this with the same string consequently?

All the time where speed matters  :P

The sarcasm and the "tongue" smiley is not the argument. You are unable to disprove what was said here and you will not ever be able to disprove that - take that or not, every people has its own right to be a fanatic.

 :biggrin:

Sorry, didn't see the tongue there, but one tend to use the same memory in these loops.

Quote
Try to find the example of algo which will call the strlen millions of times for the same string - consecuenly. Why the word is underlined? Because of you call it from time to time - it will be just in some milliseconds from the previous call flushed from the cache, so, the next call to the code will again go the slow - memory bus speed - way.

And there I disagree. When you tokenize a text file you tend to reuse the same memory. The same with reading files where you use the same file block which will end up in the cache before you call strlen to get the length of the file. There could be millions of files down the subdirectories, and you may open all of them and do a byte scan and so on.

Quote
You may take that as a rants,

Maybe a bad choice of a word, but there was a lot of text there  :biggrin:

Quote
Well, but this is probably rant, too, so if anyone has right to stick with its opinion - the disput is useless.

No, that was interesting, and I disagree, dispute is not useless.

rrr314159

  • Member
  • *****
  • Posts: 1381
Re: Research on the algos design
« Reply #22 on: March 25, 2015, 11:50:35 AM »
AMD, 2 mb L2 cache

Intel, 1 mb L2, 6 mb L3

(I think)

This testing is a PITA I expect at least a 1500-word post from you after all this trouble :)
(From other thread, Antariy advice:) On AMD use ax with bsf - Thanks! I should have thought of that :redface:

[edit] that smiley :P isn't sticking out his tongue - is he?
« Last Edit: March 26, 2015, 05:00:48 PM by rrr314159 »
I am NaN ;)

jj2007

  • Member
  • *****
  • Posts: 7558
  • Assembler is fun ;-)
    • MasmBasic
Re: Research on the algos design
« Reply #23 on: March 25, 2015, 06:50:38 PM »
When you tokenize a text file you tend to reuse the same memory. The same with reading files where you use the same file block which will end up in the cache before you call strlen to get the length of the file.

When you tokenise a 200MB file, then only a tiny fraction is in the data cache. Afterwards, the pointers may fit into the cache, but the algo needs to examine the 200MB content, not the pointers.

> before you call strlen to get the length of the file.

That sounds as if you want to use strlen for getting the 200MB info. That is hardly a real world usage; you open the file, you know its size. What you don't know yet is the length of many tiny strings (actually, MasmBasic stores the lengths during the tokenising process, but that's another topic).

nidud

  • Member
  • *****
  • Posts: 1371
    • https://github.com/nidud/asmc
Re: Research on the algos design
« Reply #24 on: March 25, 2015, 09:42:15 PM »
When you tokenise a 200MB file, then only a tiny fraction is in the data cache. Afterwards, the pointers may fit into the cache, but the algo needs to examine the 200MB content, not the pointers.

You think JWASM or similar programs use a 200M read buffer? The read process is normally streamed with a small buffer size.
Code: [Select]
_MAXIOBUF equ 4000h
_INTIOBUF equ 1000h
_MINIOBUF equ 0200h

Each line is then normally fetched to a line buffer using fgets(). Both the read buffer and the line buffer is now in the cache.

Quote
That sounds as if you want to use strlen for getting the 200MB info.

The buffer in the file block for the file name is 260 byte. Scanning around 100K files gives this result:
Code: [Select]
Average string lengths: 22, calls: 103780

So what is the average string length in general?
I will assume closer to 20 byte then 200M.

jj2007

  • Member
  • *****
  • Posts: 7558
  • Assembler is fun ;-)
    • MasmBasic
Re: Research on the algos design
« Reply #25 on: March 25, 2015, 10:07:30 PM »
Hi nidud,

It seems we have a serious communication problem - probably my fault.

What I am trying to say is:
- cache influence doesn't matter if you read in or stream in 200 MB; in contrasting to testing one (cached) string a Million times, the algo needs to access memory if the file size is well beyond the 2 or 3 MB of the L3 cache.

- the average string len of asm files is around 20 bytes, sure, for C headers it is around 35 bytes, and for the bible it's 132 bytes; so for testing real world apps, around 50 bytes might be a good compromise.

Another test for non-cached files: sorting a text file. Again, bible.txt is being used, and two algos are being compared:
Code: [Select]
Intel(R) Core(TM) i5-2450M CPU @ 2.50GHz
Masm32 assort:
30382 lines sorted in 0.014 seconds
30382 lines sorted in 0.011 seconds
30382 lines sorted in 0.012 seconds
30382 lines sorted in 0.012 seconds
30382 lines sorted in 0.011 seconds
MasmBasic QSort:
30383 lines sorted in 0.012 seconds
30383 lines sorted in 0.012 seconds
30383 lines sorted in 0.012 seconds
30383 lines sorted in 0.012 seconds
30383 lines sorted in 0.012 seconds

Bible.txt is 4MB, so it can only partially be cached. The attached archive contains a BibleSort50.exe, which creates and tests a 200MB file from bible.txt. And that one definitely doesn't fit into the cache, so be prepared for big differences between CPUs and memory characteristics like bus speed and cache size.

hutch--

  • Administrator
  • Member
  • ******
  • Posts: 4813
  • Mnemonic Driven API Grinder
    • The MASM32 SDK
Re: Research on the algos design
« Reply #26 on: March 25, 2015, 10:13:59 PM »
Bible is not a bad test piece at about 4.5 meg, if you need a decent big sample, just knock yourself up a multiple copy version, copy it onto itself until its a few hundred meg, then you will know what a linear algo speed is like. If you have a reasonable tokeniser, tokenise it first THEN time its read/write/modify speed on a massive array.
hutch at movsd dot com
http://www.masm32.com    :biggrin:  :biggrin:

jj2007

  • Member
  • *****
  • Posts: 7558
  • Assembler is fun ;-)
    • MasmBasic
Re: Research on the algos design
« Reply #27 on: March 25, 2015, 10:24:58 PM »
if you need a decent big sample, just knock yourself up a multiple copy version, copy it onto itself until its a few hundred meg

Good idea :t
The attached archive contains a BibleSort50.exe, which creates and tests a 200MB file from bible.txt.
;)

dedndave

  • Member
  • *****
  • Posts: 8734
  • Still using Abacus 2.0
    • DednDave
Re: Research on the algos design
« Reply #28 on: March 25, 2015, 10:44:12 PM »
god's gonna strike you down for using the bible as "random data"   :badgrin:

jj2007

  • Member
  • *****
  • Posts: 7558
  • Assembler is fun ;-)
    • MasmBasic
Re: Research on the algos design
« Reply #29 on: March 25, 2015, 10:54:33 PM »
god's gonna strike you down for using the bible as "random data"   :badgrin:

I am just providing this important text in a more ordered version 8)