This post will prove some points about algos.
The archive attached contains the two programs, their sources and the BAT file which needs to be run to do the actual test - no need to run the EXE files themselves. The programs are SSE2 required so at least PIV machine required to run them.
Tests are very appreciated. Please, if you decide to do the test, then do it in the following scheme:
Run the StrLenTest.bat in the command line with redirection of its output to the text file. Then copy the content of the file to the forum. If the content are too long to fit into the message, split the text content into two posts. Also, if you know that, mention the 2nd level data cache (the cache per CPU core, not full size of all caches of each core nor 3rd level cache - if the CPU has 3rd level cache - mention its size in addition) on your machine.
The full test requires that your machine is able to allocate 64 MB, so it's good idea to run the test on a machine with 256+ MBs of RAM. Though, the partial test requires only small amount of memory which is able to provide every today machine (up to 100 KB).
The testbed used is an old testbed used in 2010 to test StrLen algos which was adapted to the current task.
This post will be spit into two, because of forum's limitation (is it possible to increase it, Hutch?).
What idea does this research prove.
We all know and like to write the algos and to test them on the lab. The problem is in the method which is used in that. Currently we "test" the algos against short (relatively to the 2nd level data cache, read below for explanation) pieces of data, and the test runs many times on the same short data. Being executed in these conditions, some types of algos do not show the real world performance of their.
After first (at least after couple more) interations of the test - call to the code tested and return from it - the short data is already placed in the cache, so the next iterations of the test are going in an unfair (read below why) conditions for the real world performance of the code. So, just first iteration runs with the real world performance, the next, many thousands of iterations run in not real world but "static" conditions, so in final average results the timing of the first run has no influence to show the real performance.
Well, ok, for an instance one has a code which is the "fastest" in these conditions, but this has no much meaning for the real performance of some types of algos. For an instance, one of these types are the "StrLen" type algos.
Why is it so?
Because, in real world, the application runs in the multitasking environment with huge share of the machine resources between the applications, so, even 2nd level data cache may not be considered as exclusive use resource for some program.
But the point which has bigger value is that in the real world system the code almost always executes on the data not in cache. So, it maybe taken as an "axiom" that the code which runs on all the data passed only one interation, then the code should not be tested in the manner when the cache "help" makes the code too look fast. StrLen algos are those algos - the code walks only one interation through all the data passed, so, being called in the real world, where it's hard to find an app which will call in a loop the StrLen algo to compute the size of the same string millions of times, this code will run on the data not in cache and there is no much need to do "crazy optimisations" because all of them after all will have the bottleneck of the memory (RAM) subsystem speed, and, more over, there is a bottleneck of the paging sybsystem of memory.
If the data not in cache - it will be loaded from the RAM, this will always be much slower than the over-sophisticated code may to process.
If the data not in RAM even - but in the paging file (and this case is frequent on the multitasking system with not unlimited resources) - then the slowdown of the OS memory subsystem processing page fault and loading the page from the paging file to the RAM (only then it will be processed to the code) will cost many-many thousands of cycles, so, even incredible-crazy-sophisticated code will have to wait for that with no any respect to its sophistication.
This is the real world for algos of the type described in the axiom above.
The proving.
Well, here is the pretty sophisticated code which proves the point above.
The test consists of testing the algos:
AxStrLenSSE - the simple, one-XMM using (BTW preserving it also) code.
strlen - nudid's code which uses 32 bytes fetch with 2 XMM regs.
AxStrLenSSE1 - the complex code which maybe probably considered as a "state of art" of its kind, which uses 4 XMM regs, i.e. works in a 64 bytes chunk. The task is not very trivial because the simplest way of doing that with not too over-complicated code is with 32 bytes chunk. Being 64 bytes chunk the code has complications because there are possibilities of two 32 byte chunked misalignments, which are processed in the first part of the code. Each possibility uses "simple" 32 byte chunk to process the misaligned head, but the second, in addition, has to be corrected if it was jumped directly or after first part of misalignment correction.
After the correction code goes to the 64 bytes data fetching which in its iteration processes the rest of the string data.
The complex entry of the code which handles the non-64-alignments over 32 bytes actual chunks slows the algo down, especially with string sizes bigger than 64 (it's actual 64 byte string plus zero terminated, for an instance) and smaller than some amount of data which is too small to show the speed up of 64 bytes fetch and the complex entry of the code has the noticeable influence. But with string sized below 64 (63 chars + terminator) it's speed not much slowed down because there are shortcuts with fast returns if the string is smaller than 32 and/or 64. Actually, it's possible to do a shortcut for the bottleneck inbetween 64...256 bytes of the string data, so this will make this algo an "universal winner", but that was not the point - this will, again, make code even more complex and big, for no any reason of the real world performance.
Just look on the timings to understand that. Remember, the testing data is from machine with 256 KB of 2nd level data cache.
The actual data was continuous list, but I'll split it with comments below every block which shows some information.
Test with the string size below 2nd level data cache size (reffered to 256 KB size machine)
Intel(R) Celeron(R) CPU 2.13GHz (SSE3)
212 code size for AxStrLenSSE1 212 total bytes for AxStrLenSSE1
104 code size for strlen 104 total bytes for strlen
------- timings for 1 bytes buffer near page end -------
25 cycles for AxStrLenSSE
30 cycles for AxStrLenSSE1
25 cycles for strlen
24 cycles for AxStrLenSSE
31 cycles for AxStrLenSSE1
25 cycles for strlen
24 cycles for AxStrLenSSE
30 cycles for AxStrLenSSE1
25 cycles for strlen
25 cycles for AxStrLenSSE
30 cycles for AxStrLenSSE1
25 cycles for strlen
Intel(R) Celeron(R) CPU 2.13GHz (SSE3)
212 code size for AxStrLenSSE1 212 total bytes for AxStrLenSSE1
104 code size for strlen 104 total bytes for strlen
------- timings for 2 bytes buffer near page end -------
24 cycles for AxStrLenSSE
30 cycles for AxStrLenSSE1
25 cycles for strlen
24 cycles for AxStrLenSSE
30 cycles for AxStrLenSSE1
25 cycles for strlen
24 cycles for AxStrLenSSE
30 cycles for AxStrLenSSE1
25 cycles for strlen
24 cycles for AxStrLenSSE
30 cycles for AxStrLenSSE1
25 cycles for strlen
Intel(R) Celeron(R) CPU 2.13GHz (SSE3)
212 code size for AxStrLenSSE1 212 total bytes for AxStrLenSSE1
104 code size for strlen 104 total bytes for strlen
------- timings for 4 bytes buffer near page end -------
24 cycles for AxStrLenSSE
30 cycles for AxStrLenSSE1
25 cycles for strlen
24 cycles for AxStrLenSSE
31 cycles for AxStrLenSSE1
25 cycles for strlen
24 cycles for AxStrLenSSE
30 cycles for AxStrLenSSE1
25 cycles for strlen
24 cycles for AxStrLenSSE
30 cycles for AxStrLenSSE1
25 cycles for strlen
Intel(R) Celeron(R) CPU 2.13GHz (SSE3)
212 code size for AxStrLenSSE1 212 total bytes for AxStrLenSSE1
104 code size for strlen 104 total bytes for strlen
------- timings for 8 bytes buffer near page end -------
24 cycles for AxStrLenSSE
30 cycles for AxStrLenSSE1
25 cycles for strlen
24 cycles for AxStrLenSSE
30 cycles for AxStrLenSSE1
25 cycles for strlen
24 cycles for AxStrLenSSE
30 cycles for AxStrLenSSE1
25 cycles for strlen
24 cycles for AxStrLenSSE
31 cycles for AxStrLenSSE1
25 cycles for strlen
Well, with string sizes smaller than 16 the "fastest" algo is the first one, one-XMM reg using. If the string is 16+ with the terminator it will be 17+, so it will not fit into the one XMM reg so it will require additional execution and iteration, so after than the algo becomes "slower".
The second, complex algo, has higher overhead of complex entry than the third (nidud's) algo, so that's why it's little bit slower on short strings.
Intel(R) Celeron(R) CPU 2.13GHz (SSE3)
212 code size for AxStrLenSSE1 212 total bytes for AxStrLenSSE1
104 code size for strlen 104 total bytes for strlen
------- timings for 16 bytes buffer near page end -------
33 cycles for AxStrLenSSE
30 cycles for AxStrLenSSE1
25 cycles for strlen
32 cycles for AxStrLenSSE
31 cycles for AxStrLenSSE1
25 cycles for strlen
32 cycles for AxStrLenSSE
30 cycles for AxStrLenSSE1
25 cycles for strlen
32 cycles for AxStrLenSSE
30 cycles for AxStrLenSSE1
25 cycles for strlen
Intel(R) Celeron(R) CPU 2.13GHz (SSE3)
212 code size for AxStrLenSSE1 212 total bytes for AxStrLenSSE1
104 code size for strlen 104 total bytes for strlen
------- timings for 24 bytes buffer near page end -------
33 cycles for AxStrLenSSE
30 cycles for AxStrLenSSE1
25 cycles for strlen
32 cycles for AxStrLenSSE
30 cycles for AxStrLenSSE1
25 cycles for strlen
32 cycles for AxStrLenSSE
30 cycles for AxStrLenSSE1
25 cycles for strlen
32 cycles for AxStrLenSSE
30 cycles for AxStrLenSSE1
26 cycles for strlen
Intel(R) Celeron(R) CPU 2.13GHz (SSE3)
212 code size for AxStrLenSSE1 212 total bytes for AxStrLenSSE1
104 code size for strlen 104 total bytes for strlen
------- timings for 32 bytes buffer near page end -------
39 cycles for AxStrLenSSE
43 cycles for AxStrLenSSE1
39 cycles for strlen
39 cycles for AxStrLenSSE
43 cycles for AxStrLenSSE1
39 cycles for strlen
39 cycles for AxStrLenSSE
43 cycles for AxStrLenSSE1
39 cycles for strlen
40 cycles for AxStrLenSSE
43 cycles for AxStrLenSSE1
40 cycles for strlen
Intel(R) Celeron(R) CPU 2.13GHz (SSE3)
212 code size for AxStrLenSSE1 212 total bytes for AxStrLenSSE1
104 code size for strlen 104 total bytes for strlen
------- timings for 40 bytes buffer near page end -------
39 cycles for AxStrLenSSE
44 cycles for AxStrLenSSE1
39 cycles for strlen
39 cycles for AxStrLenSSE
43 cycles for AxStrLenSSE1
40 cycles for strlen
39 cycles for AxStrLenSSE
43 cycles for AxStrLenSSE1
39 cycles for strlen
39 cycles for AxStrLenSSE
42 cycles for AxStrLenSSE1
39 cycles for strlen
Intel(R) Celeron(R) CPU 2.13GHz (SSE3)
212 code size for AxStrLenSSE1 212 total bytes for AxStrLenSSE1
104 code size for strlen 104 total bytes for strlen
------- timings for 48 bytes buffer near page end -------
48 cycles for AxStrLenSSE
43 cycles for AxStrLenSSE1
39 cycles for strlen
48 cycles for AxStrLenSSE
43 cycles for AxStrLenSSE1
39 cycles for strlen
48 cycles for AxStrLenSSE
43 cycles for AxStrLenSSE1
39 cycles for strlen
48 cycles for AxStrLenSSE
43 cycles for AxStrLenSSE1
39 cycles for strlen
Intel(R) Celeron(R) CPU 2.13GHz (SSE3)
212 code size for AxStrLenSSE1 212 total bytes for AxStrLenSSE1
104 code size for strlen 104 total bytes for strlen
------- timings for 56 bytes buffer near page end -------
48 cycles for AxStrLenSSE
43 cycles for AxStrLenSSE1
39 cycles for strlen
48 cycles for AxStrLenSSE
43 cycles for AxStrLenSSE1
39 cycles for strlen
48 cycles for AxStrLenSSE
43 cycles for AxStrLenSSE1
39 cycles for strlen
48 cycles for AxStrLenSSE
43 cycles for AxStrLenSSE1
40 cycles for strlen
The first algo has a noticeable granularity of the 16 bytes step of the string length which "triggers" it to have next step on higher timings.
The second algo continues to behave more or less similar until the data fetch length will be overcomed with at least 64 data bytes + zero terminator.
Third algo shows noticeable granularity of the 32 bytes of data to step higher timings.