Research on the algos design

Antariy · March 24, 2015, 06:42:55 AM

This post will prove some points about algos.

The archive attached contains the two programs, their sources and the BAT file which needs to be run to do the actual test - no need to run the EXE files themselves. The programs are SSE2 required so at least PIV machine required to run them.
Tests are very appreciated. Please, if you decide to do the test, then do it in the following scheme:

Run the StrLenTest.bat in the command line with redirection of its output to the text file. Then copy the content of the file to the forum. If the content are too long to fit into the message, split the text content into two posts. Also, if you know that, mention the 2nd level data cache (the cache per CPU core, not full size of all caches of each core nor 3rd level cache - if the CPU has 3rd level cache - mention its size in addition) on your machine.
The full test requires that your machine is able to allocate 64 MB, so it's good idea to run the test on a machine with 256+ MBs of RAM. Though, the partial test requires only small amount of memory which is able to provide every today machine (up to 100 KB).

The testbed used is an old testbed used in 2010 to test StrLen algos which was adapted to the current task.

This post will be spit into two, because of forum's limitation (is it possible to increase it, Hutch?).

What idea does this research prove.

We all know and like to write the algos and to test them on the lab. The problem is in the method which is used in that. Currently we "test" the algos against short (relatively to the 2nd level data cache, read below for explanation) pieces of data, and the test runs many times on the same short data. Being executed in these conditions, some types of algos do not show the real world performance of their.

After first (at least after couple more) interations of the test - call to the code tested and return from it - the short data is already placed in the cache, so the next iterations of the test are going in an unfair (read below why) conditions for the real world performance of the code. So, just first iteration runs with the real world performance, the next, many thousands of iterations run in not real world but "static" conditions, so in final average results the timing of the first run has no influence to show the real performance.

Well, ok, for an instance one has a code which is the "fastest" in these conditions, but this has no much meaning for the real performance of some types of algos. For an instance, one of these types are the "StrLen" type algos.

Why is it so?

Because, in real world, the application runs in the multitasking environment with huge share of the machine resources between the applications, so, even 2nd level data cache may not be considered as exclusive use resource for some program.

But the point which has bigger value is that in the real world system the code almost always executes on the data not in cache. So, it maybe taken as an "axiom" that the code which runs on all the data passed only one interation, then the code should not be tested in the manner when the cache "help" makes the code too look fast. StrLen algos are those algos - the code walks only one interation through all the data passed, so, being called in the real world, where it's hard to find an app which will call in a loop the StrLen algo to compute the size of the same string millions of times, this code will run on the data not in cache and there is no much need to do "crazy optimisations" because all of them after all will have the bottleneck of the memory (RAM) subsystem speed, and, more over, there is a bottleneck of the paging sybsystem of memory.

If the data not in cache - it will be loaded from the RAM, this will always be much slower than the over-sophisticated code may to process.

If the data not in RAM even - but in the paging file (and this case is frequent on the multitasking system with not unlimited resources) - then the slowdown of the OS memory subsystem processing page fault and loading the page from the paging file to the RAM (only then it will be processed to the code) will cost many-many thousands of cycles, so, even incredible-crazy-sophisticated code will have to wait for that with no any respect to its sophistication.

This is the real world for algos of the type described in the axiom above.

The proving.

Well, here is the pretty sophisticated code which proves the point above.

The test consists of testing the algos:

AxStrLenSSE - the simple, one-XMM using (BTW preserving it also) code.

strlen - nudid's code which uses 32 bytes fetch with 2 XMM regs.

AxStrLenSSE1 - the complex code which maybe probably considered as a "state of art" of its kind, which uses 4 XMM regs, i.e. works in a 64 bytes chunk. The task is not very trivial because the simplest way of doing that with not too over-complicated code is with 32 bytes chunk. Being 64 bytes chunk the code has complications because there are possibilities of two 32 byte chunked misalignments, which are processed in the first part of the code. Each possibility uses "simple" 32 byte chunk to process the misaligned head, but the second, in addition, has to be corrected if it was jumped directly or after first part of misalignment correction.

After the correction code goes to the 64 bytes data fetching which in its iteration processes the rest of the string data.

The complex entry of the code which handles the non-64-alignments over 32 bytes actual chunks slows the algo down, especially with string sizes bigger than 64 (it's actual 64 byte string plus zero terminated, for an instance) and smaller than some amount of data which is too small to show the speed up of 64 bytes fetch and the complex entry of the code has the noticeable influence. But with string sized below 64 (63 chars + terminator) it's speed not much slowed down because there are shortcuts with fast returns if the string is smaller than 32 and/or 64. Actually, it's possible to do a shortcut for the bottleneck inbetween 64...256 bytes of the string data, so this will make this algo an "universal winner", but that was not the point - this will, again, make code even more complex and big, for no any reason of the real world performance.

Just look on the timings to understand that. Remember, the testing data is from machine with 256 KB of 2nd level data cache.
The actual data was continuous list, but I'll split it with comments below every block which shows some information.

Code Select


Test with the string size below 2nd level data cache size (reffered to 256 KB size machine)
Intel(R) Celeron(R) CPU 2.13GHz (SSE3)
212     code size for AxStrLenSSE1      212     total bytes for AxStrLenSSE1
104     code size for strlen    104     total bytes for strlen
------- timings for 1 bytes buffer near page end -------
25      cycles for AxStrLenSSE
30      cycles for AxStrLenSSE1
25      cycles for strlen

24      cycles for AxStrLenSSE
31      cycles for AxStrLenSSE1
25      cycles for strlen

24      cycles for AxStrLenSSE
30      cycles for AxStrLenSSE1
25      cycles for strlen

25      cycles for AxStrLenSSE
30      cycles for AxStrLenSSE1
25      cycles for strlen

Intel(R) Celeron(R) CPU 2.13GHz (SSE3)
212     code size for AxStrLenSSE1      212     total bytes for AxStrLenSSE1
104     code size for strlen    104     total bytes for strlen
------- timings for 2 bytes buffer near page end -------
24      cycles for AxStrLenSSE
30      cycles for AxStrLenSSE1
25      cycles for strlen

24      cycles for AxStrLenSSE
30      cycles for AxStrLenSSE1
25      cycles for strlen

24      cycles for AxStrLenSSE
30      cycles for AxStrLenSSE1
25      cycles for strlen

24      cycles for AxStrLenSSE
30      cycles for AxStrLenSSE1
25      cycles for strlen

Intel(R) Celeron(R) CPU 2.13GHz (SSE3)
212     code size for AxStrLenSSE1      212     total bytes for AxStrLenSSE1
104     code size for strlen    104     total bytes for strlen
------- timings for 4 bytes buffer near page end -------
24      cycles for AxStrLenSSE
30      cycles for AxStrLenSSE1
25      cycles for strlen

24      cycles for AxStrLenSSE
31      cycles for AxStrLenSSE1
25      cycles for strlen

24      cycles for AxStrLenSSE
30      cycles for AxStrLenSSE1
25      cycles for strlen

24      cycles for AxStrLenSSE
30      cycles for AxStrLenSSE1
25      cycles for strlen

Intel(R) Celeron(R) CPU 2.13GHz (SSE3)
212     code size for AxStrLenSSE1      212     total bytes for AxStrLenSSE1
104     code size for strlen    104     total bytes for strlen
------- timings for 8 bytes buffer near page end -------
24      cycles for AxStrLenSSE
30      cycles for AxStrLenSSE1
25      cycles for strlen

24      cycles for AxStrLenSSE
30      cycles for AxStrLenSSE1
25      cycles for strlen

24      cycles for AxStrLenSSE
30      cycles for AxStrLenSSE1
25      cycles for strlen

24      cycles for AxStrLenSSE
31      cycles for AxStrLenSSE1
25      cycles for strlen

Well, with string sizes smaller than 16 the "fastest" algo is the first one, one-XMM reg using. If the string is 16+ with the terminator it will be 17+, so it will not fit into the one XMM reg so it will require additional execution and iteration, so after than the algo becomes "slower".

The second, complex algo, has higher overhead of complex entry than the third (nidud's) algo, so that's why it's little bit slower on short strings.

Code Select


Intel(R) Celeron(R) CPU 2.13GHz (SSE3)
212     code size for AxStrLenSSE1      212     total bytes for AxStrLenSSE1
104     code size for strlen    104     total bytes for strlen
------- timings for 16 bytes buffer near page end -------
33      cycles for AxStrLenSSE
30      cycles for AxStrLenSSE1
25      cycles for strlen

32      cycles for AxStrLenSSE
31      cycles for AxStrLenSSE1
25      cycles for strlen

32      cycles for AxStrLenSSE
30      cycles for AxStrLenSSE1
25      cycles for strlen

32      cycles for AxStrLenSSE
30      cycles for AxStrLenSSE1
25      cycles for strlen

Intel(R) Celeron(R) CPU 2.13GHz (SSE3)
212     code size for AxStrLenSSE1      212     total bytes for AxStrLenSSE1
104     code size for strlen    104     total bytes for strlen
------- timings for 24 bytes buffer near page end -------
33      cycles for AxStrLenSSE
30      cycles for AxStrLenSSE1
25      cycles for strlen

32      cycles for AxStrLenSSE
30      cycles for AxStrLenSSE1
25      cycles for strlen

32      cycles for AxStrLenSSE
30      cycles for AxStrLenSSE1
25      cycles for strlen

32      cycles for AxStrLenSSE
30      cycles for AxStrLenSSE1
26      cycles for strlen

Intel(R) Celeron(R) CPU 2.13GHz (SSE3)
212     code size for AxStrLenSSE1      212     total bytes for AxStrLenSSE1
104     code size for strlen    104     total bytes for strlen
------- timings for 32 bytes buffer near page end -------
39      cycles for AxStrLenSSE
43      cycles for AxStrLenSSE1
39      cycles for strlen

39      cycles for AxStrLenSSE
43      cycles for AxStrLenSSE1
39      cycles for strlen

39      cycles for AxStrLenSSE
43      cycles for AxStrLenSSE1
39      cycles for strlen

40      cycles for AxStrLenSSE
43      cycles for AxStrLenSSE1
40      cycles for strlen

Intel(R) Celeron(R) CPU 2.13GHz (SSE3)
212     code size for AxStrLenSSE1      212     total bytes for AxStrLenSSE1
104     code size for strlen    104     total bytes for strlen
------- timings for 40 bytes buffer near page end -------
39      cycles for AxStrLenSSE
44      cycles for AxStrLenSSE1
39      cycles for strlen

39      cycles for AxStrLenSSE
43      cycles for AxStrLenSSE1
40      cycles for strlen

39      cycles for AxStrLenSSE
43      cycles for AxStrLenSSE1
39      cycles for strlen

39      cycles for AxStrLenSSE
42      cycles for AxStrLenSSE1
39      cycles for strlen

Intel(R) Celeron(R) CPU 2.13GHz (SSE3)
212     code size for AxStrLenSSE1      212     total bytes for AxStrLenSSE1
104     code size for strlen    104     total bytes for strlen
------- timings for 48 bytes buffer near page end -------
48      cycles for AxStrLenSSE
43      cycles for AxStrLenSSE1
39      cycles for strlen

48      cycles for AxStrLenSSE
43      cycles for AxStrLenSSE1
39      cycles for strlen

48      cycles for AxStrLenSSE
43      cycles for AxStrLenSSE1
39      cycles for strlen

48      cycles for AxStrLenSSE
43      cycles for AxStrLenSSE1
39      cycles for strlen

Intel(R) Celeron(R) CPU 2.13GHz (SSE3)
212     code size for AxStrLenSSE1      212     total bytes for AxStrLenSSE1
104     code size for strlen    104     total bytes for strlen
------- timings for 56 bytes buffer near page end -------
48      cycles for AxStrLenSSE
43      cycles for AxStrLenSSE1
39      cycles for strlen

48      cycles for AxStrLenSSE
43      cycles for AxStrLenSSE1
39      cycles for strlen

48      cycles for AxStrLenSSE
43      cycles for AxStrLenSSE1
39      cycles for strlen

48      cycles for AxStrLenSSE
43      cycles for AxStrLenSSE1
40      cycles for strlen

The first algo has a noticeable granularity of the 16 bytes step of the string length which "triggers" it to have next step on higher timings.
The second algo continues to behave more or less similar until the data fetch length will be overcomed with at least 64 data bytes + zero terminator.
Third algo shows noticeable granularity of the 32 bytes of data to step higher timings.

Antariy · March 24, 2015, 06:46:21 AM

Code Select


Intel(R) Celeron(R) CPU 2.13GHz (SSE3)
212     code size for AxStrLenSSE1      212     total bytes for AxStrLenSSE1
104     code size for strlen    104     total bytes for strlen
------- timings for 64 bytes buffer near page end -------
50      cycles for AxStrLenSSE
62      cycles for AxStrLenSSE1
47      cycles for strlen

50      cycles for AxStrLenSSE
62      cycles for AxStrLenSSE1
46      cycles for strlen

50      cycles for AxStrLenSSE
62      cycles for AxStrLenSSE1
47      cycles for strlen

50      cycles for AxStrLenSSE
62      cycles for AxStrLenSSE1
46      cycles for strlen

Intel(R) Celeron(R) CPU 2.13GHz (SSE3)
212     code size for AxStrLenSSE1      212     total bytes for AxStrLenSSE1
104     code size for strlen    104     total bytes for strlen
------- timings for 72 bytes buffer near page end -------
50      cycles for AxStrLenSSE
62      cycles for AxStrLenSSE1
46      cycles for strlen

50      cycles for AxStrLenSSE
62      cycles for AxStrLenSSE1
46      cycles for strlen

49      cycles for AxStrLenSSE
63      cycles for AxStrLenSSE1
46      cycles for strlen

50      cycles for AxStrLenSSE
62      cycles for AxStrLenSSE1
46      cycles for strlen

Intel(R) Celeron(R) CPU 2.13GHz (SSE3)
212     code size for AxStrLenSSE1      212     total bytes for AxStrLenSSE1
104     code size for strlen    104     total bytes for strlen
------- timings for 80 bytes buffer near page end -------
55      cycles for AxStrLenSSE
62      cycles for AxStrLenSSE1
46      cycles for strlen

55      cycles for AxStrLenSSE
62      cycles for AxStrLenSSE1
46      cycles for strlen

55      cycles for AxStrLenSSE
77      cycles for AxStrLenSSE1
46      cycles for strlen

56      cycles for AxStrLenSSE
62      cycles for AxStrLenSSE1
46      cycles for strlen

Intel(R) Celeron(R) CPU 2.13GHz (SSE3)
212     code size for AxStrLenSSE1      212     total bytes for AxStrLenSSE1
104     code size for strlen    104     total bytes for strlen
------- timings for 88 bytes buffer near page end -------
55      cycles for AxStrLenSSE
62      cycles for AxStrLenSSE1
46      cycles for strlen

55      cycles for AxStrLenSSE
62      cycles for AxStrLenSSE1
46      cycles for strlen

57      cycles for AxStrLenSSE
62      cycles for AxStrLenSSE1
46      cycles for strlen

55      cycles for AxStrLenSSE
62      cycles for AxStrLenSSE1
46      cycles for strlen

Intel(R) Celeron(R) CPU 2.13GHz (SSE3)
212     code size for AxStrLenSSE1      212     total bytes for AxStrLenSSE1
104     code size for strlen    104     total bytes for strlen
------- timings for 96 bytes buffer near page end -------
63      cycles for AxStrLenSSE
74      cycles for AxStrLenSSE1
54      cycles for strlen

63      cycles for AxStrLenSSE
74      cycles for AxStrLenSSE1
54      cycles for strlen

64      cycles for AxStrLenSSE
73      cycles for AxStrLenSSE1
54      cycles for strlen

64      cycles for AxStrLenSSE
83      cycles for AxStrLenSSE1
54      cycles for strlen

Intel(R) Celeron(R) CPU 2.13GHz (SSE3)
212     code size for AxStrLenSSE1      212     total bytes for AxStrLenSSE1
104     code size for strlen    104     total bytes for strlen
------- timings for 100 bytes buffer near page end -------
63      cycles for AxStrLenSSE
74      cycles for AxStrLenSSE1
54      cycles for strlen

63      cycles for AxStrLenSSE
76      cycles for AxStrLenSSE1
54      cycles for strlen

64      cycles for AxStrLenSSE
73      cycles for AxStrLenSSE1
55      cycles for strlen

64      cycles for AxStrLenSSE
73      cycles for AxStrLenSSE1
55      cycles for strlen

Intel(R) Celeron(R) CPU 2.13GHz (SSE3)
212     code size for AxStrLenSSE1      212     total bytes for AxStrLenSSE1
104     code size for strlen    104     total bytes for strlen
------- timings for 128 bytes buffer near page end -------
87      cycles for AxStrLenSSE
76      cycles for AxStrLenSSE1
63      cycles for strlen

77      cycles for AxStrLenSSE
76      cycles for AxStrLenSSE1
63      cycles for strlen

77      cycles for AxStrLenSSE
75      cycles for AxStrLenSSE1
63      cycles for strlen

79      cycles for AxStrLenSSE
75      cycles for AxStrLenSSE1
63      cycles for strlen

Intel(R) Celeron(R) CPU 2.13GHz (SSE3)
212     code size for AxStrLenSSE1      212     total bytes for AxStrLenSSE1
104     code size for strlen    104     total bytes for strlen
------- timings for 192 bytes buffer near page end -------
105     cycles for AxStrLenSSE
93      cycles for AxStrLenSSE1
84      cycles for strlen

105     cycles for AxStrLenSSE
94      cycles for AxStrLenSSE1
84      cycles for strlen

105     cycles for AxStrLenSSE
94      cycles for AxStrLenSSE1
84      cycles for strlen

105     cycles for AxStrLenSSE
94      cycles for AxStrLenSSE1
85      cycles for strlen

Intel(R) Celeron(R) CPU 2.13GHz (SSE3)
212     code size for AxStrLenSSE1      212     total bytes for AxStrLenSSE1
104     code size for strlen    104     total bytes for strlen
------- timings for 256 bytes buffer near page end -------
129     cycles for AxStrLenSSE
109     cycles for AxStrLenSSE1
103     cycles for strlen

129     cycles for AxStrLenSSE
110     cycles for AxStrLenSSE1
103     cycles for strlen

129     cycles for AxStrLenSSE
109     cycles for AxStrLenSSE1
102     cycles for strlen

128     cycles for AxStrLenSSE
110     cycles for AxStrLenSSE1
103     cycles for strlen

Second algo shows the "cost" of the complex entry + huge data fetch code which cannot "pay its cost back" with too short strings. After crossing the 64 bytes of string data the algo noticeable slower than other two. It overspeed the first algo only after cross the 128 bytes of string data, and hits the third algo after it crosses 256 bytes boundary. After that the algo continues to show its increasing overspeed of the other algos until the data size (string length) is small enough to fit into the cache.

The other two algos behave the same as above - continue to show the granularity of 16 and 32 bytes for first and third algo respectively. Since they do not have to "pay back" the high "cost" of complex entry.

Code Select


Intel(R) Celeron(R) CPU 2.13GHz (SSE3)
212     code size for AxStrLenSSE1      212     total bytes for AxStrLenSSE1
104     code size for strlen    104     total bytes for strlen
------- timings for 384 bytes buffer near page end -------
265     cycles for AxStrLenSSE
144     cycles for AxStrLenSSE1
143     cycles for strlen

264     cycles for AxStrLenSSE
142     cycles for AxStrLenSSE1
143     cycles for strlen

265     cycles for AxStrLenSSE
142     cycles for AxStrLenSSE1
143     cycles for strlen

264     cycles for AxStrLenSSE
143     cycles for AxStrLenSSE1
143     cycles for strlen

Intel(R) Celeron(R) CPU 2.13GHz (SSE3)
212     code size for AxStrLenSSE1      212     total bytes for AxStrLenSSE1
104     code size for strlen    104     total bytes for strlen
------- timings for 512 bytes buffer near page end -------
318     cycles for AxStrLenSSE
181     cycles for AxStrLenSSE1
182     cycles for strlen

317     cycles for AxStrLenSSE
175     cycles for AxStrLenSSE1
182     cycles for strlen

317     cycles for AxStrLenSSE
175     cycles for AxStrLenSSE1
182     cycles for strlen

317     cycles for AxStrLenSSE
177     cycles for AxStrLenSSE1
181     cycles for strlen

Intel(R) Celeron(R) CPU 2.13GHz (SSE3)
212     code size for AxStrLenSSE1      212     total bytes for AxStrLenSSE1
104     code size for strlen    104     total bytes for strlen
------- timings for 1024 bytes buffer near page end -------
525     cycles for AxStrLenSSE
305     cycles for AxStrLenSSE1
417     cycles for strlen

525     cycles for AxStrLenSSE
305     cycles for AxStrLenSSE1
416     cycles for strlen

533     cycles for AxStrLenSSE
305     cycles for AxStrLenSSE1
418     cycles for strlen

523     cycles for AxStrLenSSE
305     cycles for AxStrLenSSE1
417     cycles for strlen

Intel(R) Celeron(R) CPU 2.13GHz (SSE3)
212     code size for AxStrLenSSE1      212     total bytes for AxStrLenSSE1
104     code size for strlen    104     total bytes for strlen
------- timings for 10240 bytes buffer near page end -------
4274    cycles for AxStrLenSSE
2716    cycles for AxStrLenSSE1
3192    cycles for strlen

4285    cycles for AxStrLenSSE
2742    cycles for AxStrLenSSE1
3194    cycles for strlen

4263    cycles for AxStrLenSSE
2733    cycles for AxStrLenSSE1
3194    cycles for strlen

4265    cycles for AxStrLenSSE
2714    cycles for AxStrLenSSE1
3205    cycles for strlen

Intel(R) Celeron(R) CPU 2.13GHz (SSE3)
212     code size for AxStrLenSSE1      212     total bytes for AxStrLenSSE1
104     code size for strlen    104     total bytes for strlen
------- timings for 102400 bytes buffer near page end -------
56469   cycles for AxStrLenSSE
29757   cycles for AxStrLenSSE1
44213   cycles for strlen

56588   cycles for AxStrLenSSE
29776   cycles for AxStrLenSSE1
44368   cycles for strlen

56429   cycles for AxStrLenSSE
29744   cycles for AxStrLenSSE1
44209   cycles for strlen

56706   cycles for AxStrLenSSE
29793   cycles for AxStrLenSSE1
44201   cycles for strlen

Being equal in speed with the third algo, after 256 bytes boundary, the second algo continues to overspeed other algos - with 1024 bytes string it is faster by 25% that third algo, with 10240 bytes sting it's faster by ~15% (probably here we see some interference of the not very big cache size of the machine, some underlying hardware things prevent code to show higher than 25% over-performing, but with next test - 102400 bytes - it will do that), with the 102400 bytes string it's faster by ~33%.

So, here we see that the second algo paid back its cost of complexity and started to overspeed other algos very noticeable.
But this will continue only until the data is fit into the 2nd level data cache. After than the sophistication of the code has no much value at all - the every tested code shows that it's actually limited to the throughput of the memory bus subsystem.

The tests proving that are below.

Code Select


Test with the string size above 2nd level data cache size (reffered to 256 KB size machine)
Intel(R) Celeron(R) CPU 2.13GHz (SSE3)
212     code size for AxStrLenSSE1      212     total bytes for AxStrLenSSE1
104     code size for strlen    104     total bytes for strlen
------- timings for 1048576 bytes buffer near page end -------
1072862 cycles for AxStrLenSSE
1009492 cycles for AxStrLenSSE1
1019562 cycles for strlen

1065923 cycles for AxStrLenSSE
1018654 cycles for AxStrLenSSE1
1022758 cycles for strlen

1066194 cycles for AxStrLenSSE
1077587 cycles for AxStrLenSSE1
1067003 cycles for strlen

1056830 cycles for AxStrLenSSE
1012637 cycles for AxStrLenSSE1
1023902 cycles for strlen

Intel(R) Celeron(R) CPU 2.13GHz (SSE3)
212     code size for AxStrLenSSE1      212     total bytes for AxStrLenSSE1
104     code size for strlen    104     total bytes for strlen
------- timings for 2097152 bytes buffer near page end -------
2121654 cycles for AxStrLenSSE
2034694 cycles for AxStrLenSSE1
2054623 cycles for strlen

2115131 cycles for AxStrLenSSE
2022984 cycles for AxStrLenSSE1
2052423 cycles for strlen

2103862 cycles for AxStrLenSSE
2029368 cycles for AxStrLenSSE1
2055135 cycles for strlen

2121802 cycles for AxStrLenSSE
2033895 cycles for AxStrLenSSE1
2057796 cycles for strlen

Intel(R) Celeron(R) CPU 2.13GHz (SSE3)
212     code size for AxStrLenSSE1      212     total bytes for AxStrLenSSE1
104     code size for strlen    104     total bytes for strlen
------- timings for 4194304 bytes buffer near page end -------
4233774 cycles for AxStrLenSSE
4062312 cycles for AxStrLenSSE1
4118572 cycles for strlen

4232783 cycles for AxStrLenSSE
4094175 cycles for AxStrLenSSE1
4126694 cycles for strlen

4232998 cycles for AxStrLenSSE
4064766 cycles for AxStrLenSSE1
4118622 cycles for strlen

4216718 cycles for AxStrLenSSE
4103574 cycles for AxStrLenSSE1
4114675 cycles for strlen

Intel(R) Celeron(R) CPU 2.13GHz (SSE3)
212     code size for AxStrLenSSE1      212     total bytes for AxStrLenSSE1
104     code size for strlen    104     total bytes for strlen
------- timings for 8388608 bytes buffer near page end -------
8651170 cycles for AxStrLenSSE
8113898 cycles for AxStrLenSSE1
8221349 cycles for strlen

8445985 cycles for AxStrLenSSE
8124165 cycles for AxStrLenSSE1
8210923 cycles for strlen

8444750 cycles for AxStrLenSSE
8386790 cycles for AxStrLenSSE1
8229474 cycles for strlen

8441142 cycles for AxStrLenSSE
8126330 cycles for AxStrLenSSE1
8239519 cycles for strlen

Intel(R) Celeron(R) CPU 2.13GHz (SSE3)
212     code size for AxStrLenSSE1      212     total bytes for AxStrLenSSE1
104     code size for strlen    104     total bytes for strlen
------- timings for 16777216 bytes buffer near page end -------
16873029        cycles for AxStrLenSSE
16244376        cycles for AxStrLenSSE1
16431242        cycles for strlen

16862886        cycles for AxStrLenSSE
16297743        cycles for AxStrLenSSE1
16425690        cycles for strlen

17146142        cycles for AxStrLenSSE
16213937        cycles for AxStrLenSSE1
16427854        cycles for strlen

16937842        cycles for AxStrLenSSE
16326234        cycles for AxStrLenSSE1
16412266        cycles for strlen

Intel(R) Celeron(R) CPU 2.13GHz (SSE3)
212     code size for AxStrLenSSE1      212     total bytes for AxStrLenSSE1
104     code size for strlen    104     total bytes for strlen
------- timings for 33554432 bytes buffer near page end -------
33769089        cycles for AxStrLenSSE
32569609        cycles for AxStrLenSSE1
32847997        cycles for strlen

33761229        cycles for AxStrLenSSE
32440494        cycles for AxStrLenSSE1
32934703        cycles for strlen

33767590        cycles for AxStrLenSSE
32429154        cycles for AxStrLenSSE1
32846578        cycles for strlen

33867550        cycles for AxStrLenSSE
32698092        cycles for AxStrLenSSE1
32856044        cycles for strlen

Intel(R) Celeron(R) CPU 2.13GHz (SSE3)
212     code size for AxStrLenSSE1      212     total bytes for AxStrLenSSE1
104     code size for strlen    104     total bytes for strlen
------- timings for 67108864 bytes buffer near page end -------
67551116        cycles for AxStrLenSSE
64992316        cycles for AxStrLenSSE1
65833018        cycles for strlen

67516614        cycles for AxStrLenSSE
65174418        cycles for AxStrLenSSE1
65777091        cycles for strlen

68162038        cycles for AxStrLenSSE
65261270        cycles for AxStrLenSSE1
65672814        cycles for strlen

67654958        cycles for AxStrLenSSE
64951990        cycles for AxStrLenSSE1
65857864        cycles for strlen

As you see, the algos almost go "head to head". Starting from the 1 MB of the data and until the 64 megs. Do not forget that this data is from 256 KB cached machine, so on your machine you will have other results, regarding to the size of your cache size.

Here we see, than "even" simple first algo goes very close to the other algos - it's just ~4 percents from the the fastest algo here - second algo.

Actually, from that we can calculate, that the throughput of the memory bus was 2133333333/(64951990/67108864) = 2204175368 bytes per second which is 2,05 GB per second. The theoretical memory speed on this machine is a bit above 3 GB per second (PC3200), so, taking in account that the code runs in a multitasking environment, this speed is pretty high and can be considered as near to the maximum of the real world memory bus throughput for the application on the multitasking OS. Also we should remember that the memory actually was not locked - i.e. the data was not commited into RAM memory and locked there, so, paging had an influence on timings, too, without that the throughput should have been higher. But, again, than was a real world test, not a "static" test - as there are almost no possibility that anyone will write a program which allocates their strings and locks them in RAM, as well as no possibility the one doing this will also run the strlen algo against these strings millions of times continuously for the every string.

Conclusion

1) In real world the data passed to the algos are almost always not in the cache, especially if that's data which doesn't get touched very frequently (many times per second).

2) The data may even be not in RAM but in the paging file, which makes access to the data even slower.

3) The algos which do not do more than one iteration over all the data passed to them, taking in account what said in two points above, should not be over complicated, since this will lead in no any gain in the real world. The algos should use as much as possible to the bottleneck of the hardware - any further "synthetical" optimizations will lead to overcomplexity of the code, the code size grow with no gain of the performance.

In this research it is shown that the memory bandwidth is the bottleneck, and there is no sence in testing thousands of times on short data which fits into the cache to "proudly report" about "the fastest algo".

Since the memory bus is the limiter, and because of three points above, the code will not be faster than the memory bus allows it to be.

Regular szLen from the MASM32 lib was twice slower than other algos in the test with big data size (bigger than the cache size) - and that are the real world comparations, though the second algo was faster than the szLen in almost 10 times with small data fitting into the cache - this "speed" has no meaning taking in account the points, research and tests above.
So, the simplest solution here, which will work faster than regular GPRs byte scan and which will work almost as fast (slower than others by couple of percents in avarage) as other more complex and big algos tested here, is the first algo. And that's "enough algo" for the algo of this kind.

How to test the algos which fall under the point 3) above (the "axiom").

Just test your algos on the data which is bigger than the cache size. Let's say at 10 times - and you'll be sure that if it's faster than other algo being bigger and more complex than the other algo, or if than it or as fast as the other algo and at the same time it's shorter and simpler, it's the best algo.

Yes, real world performance of your algos on the real world machine (3 points above and this post at all describes that), even with small data passed to them, will be the same as when you pass them the huge data bigger than cache size and do processing just some times to give average and throw away the paging slowdowns (divide this slowdown between test loops). But their performance will not be such "high" as it is when you test it millions of times on the short data.

nidud · March 24, 2015, 08:19:05 AM

deleted

jj2007 · March 24, 2015, 08:35:56 AM

Hi Alex,
Interesting stuff. Here are i5 results for you :t

Antariy · March 24, 2015, 08:46:58 AM

Thank you, nidud! :t

The testbed is a bit sloppy but shows the idea pretty well.
It was unable to allocate 64 MBs of the buffer? You data contains the info until 32 megs. Or you did just stopped the test?

Your CPU has 2 or 4 MB of 2nd level cache? The timings with 1 MB buffer are already close but still show that the buffer fits into the cache, so this is just a guess that the cache has 2 or 4 mb size since these tests already show that the code run very close independently from the algo.

Antariy · March 24, 2015, 08:59:50 AM

Thank you, Jochen! :t

It's interesting to see that on your CPU the short code runs the same as 32 bytes fething code until 64 bytes string and after cross of 1 MB string.

BTW, you CPU has 1 or 2 MBs of cache? It maybe 1 MB - the timings are unstable yet but already get closer - but with 2 MB they go close, so, probably 2 MB os 2nd level data cache per core?

jj2007 · March 24, 2015, 10:22:39 AM

Quote from: Antariy on March 24, 2015, 08:59:50 AM
BTW, you CPU has 1 or 2 MBs of cache?

i5? 2MB, I suppose, but I'm not sure.

hutch-- · March 24, 2015, 12:38:17 PM

I have some funny old fashioned views on benchmarking algorithms. If you want to purely test mnemonic combinations, bashing away at looped data gives you some idea of the difference but as you want to test algorithms in a real world context to see if there is a speed difference, you start to need very large examples so that you effectively break the data cache. If you want to test the linear speed of an algorithm, do it on a gigabyte sample so that cache thrashing is not a factor.

if alternately you want to time the attack of an algorithm (how fast it gets going) you make another massive array of short data items and loop through it while timing the results. The closer a benchmark is to the conditions it will be used in will give you some idea of how fast it is. Short looped timing have the problem that the first few iterations are slower than the rest once the data is in cache and to this extent Alex is correct in that looped timings rarely every give you a real world result. The numbers may look cute but they have little resemblance to how well they perform.

jj2007 · March 24, 2015, 01:38:30 PM

Simple test with a 4.3 MB file:

include \masm32\MasmBasic\MasmBasic.inc ; download
Init
Let esi="Bible.txt"
xor ebx, ebx
.if !Exist(esi)
Let esi="http://files.allbasic.info/ScriptBasic/Bible.txt"
inc ebx
.endif
Recall esi, t$()
push eax ; string counter
NanoTimer()
.if ebx
Rename "~FrLocal.tmp", "Bible.txt"
PrintLine "Downloaded Bible.txt from ", esi
.endif
xor ebx, ebx
.Repeat
void Len(t$(ebx))
inc ebx
.Until ebx>=stack
Print Str$("Len(): %i µs", NanoTimer(µs)), Str$(" for %i lines\n", ebx)

NanoTimer()
xor ebx, ebx
.Repeat
void len(t$(ebx))
inc ebx
.Until ebx>=stack
Print Str$("len(): %i µs", NanoTimer(µs)), Str$(" for %i lines\n", ebx)
Exit
end start

Code Select

Len(): 1277 µs for 33979 lines
len(): 2499 µs for 33979 lines

rrr314159 · March 24, 2015, 01:47:01 PM

I agree with all of the above - including the half I didn't read

Antariy · March 24, 2015, 08:45:58 PM

Quote from: jj2007 on March 24, 2015, 01:38:30 PM
Simple test with a 4.3 MB file:

include \masm32\MasmBasic\MasmBasic.inc ; download
Init
Let esi="Bible.txt"
xor ebx, ebx
.if !Exist(esi)
Let esi="http://files.allbasic.info/ScriptBasic/Bible.txt"
inc ebx
.endif
Recall esi, t$()
push eax ; string counter
NanoTimer()
.if ebx
Rename "~FrLocal.tmp", "Bible.txt"
PrintLine "Downloaded Bible.txt from ", esi
.endif
xor ebx, ebx
.Repeat
void Len(t$(ebx))
inc ebx
.Until ebx>=stack
Print Str$("Len(): %i µs", NanoTimer(µs)), Str$(" for %i lines\n", ebx)

NanoTimer()
xor ebx, ebx
.Repeat
void len(t$(ebx))
inc ebx
.Until ebx>=stack
Print Str$("len(): %i µs", NanoTimer(µs)), Str$(" for %i lines\n", ebx)
Exit
end start

Code Select Expand
Len(): 1277 µs for 33979 lines len(): 2499 µs for 33979 lines

len is the default MASM32 library procedure?
Ys you see, for your machine the SSE code with data which doesn't fits "neatly" into the cache, the GPR code is slower in factor of 2, the same I mentioned in the big post, but maybe it has some bias on the different machines, that was not a "constant" said.

Antariy · March 24, 2015, 08:58:04 PM

Quote from: hutch-- on March 24, 2015, 12:38:17 PM
I have some funny old fashioned views on benchmarking algorithms. If you want to purely test mnemonic combinations, bashing away at looped data gives you some idea of the difference but as you want to test algorithms in a real world context to see if there is a speed difference, you start to need very large examples so that you effectively break the data cache. If you want to test the linear speed of an algorithm, do it on a gigabyte sample so that cache thrashing is not a factor.

if alternately you want to time the attack of an algorithm (how fast it gets going) you make another massive array of short data items and loop through it while timing the results. The closer a benchmark is to the conditions it will be used in will give you some idea of how fast it is. Short looped timing have the problem that the first few iterations are slower than the rest once the data is in cache and to this extent Alex is correct in that looped timings rarely every give you a real world result. The numbers may look cute but they have little resemblance to how well they perform.

Yes, that was I said about, too :t Benchmarking of the algos which fall to the point 3) (second part of the bigger post in the start of thread) should be tested via huge massive of data which will flood the cache and clear away its "help" to the algo which actually in real world will not touch the same data more than once (StrLen is such algo - it walks through data bytes just once).

Thought the algos which do number crunching on some limited, fitting into the cache, data, and does than in many iterations, may be tested with the simpler methods like the one we using on the lab - i.e. timing it with many iterations over it and not too big data passed to it.

That was the point and thus I underlined that this is research on the algos design, the algos which fall to the point 3) conditions.

Antariy · March 24, 2015, 08:59:27 PM

Quote from: rrr314159 on March 24, 2015, 01:47:01 PM
I agree with all of the above - including the half I didn't read

But the other question: Hutch, rrr, where are the timings?

Though I'm sure that they will prove the main post of the thread, the timings part which is for short data strings is interesting, too.

Gunther · March 24, 2015, 10:18:01 PM

Hi Alex,

here are your timings. It's an old machine in the university. Should I test it with my machine at home for you?
First:

Code Select


AMD Athlon(tm) 64 X2 Dual Core Processor 4400+ (SSE3)
212     code size for AxStrLenSSE1      212     total bytes for AxStrLenSSE1
104     code size for strlen    104     total bytes for strlen
------- timings for 10240 bytes buffer near page end -------
2616    cycles for AxStrLenSSE
1508    cycles for AxStrLenSSE1
1980    cycles for strlen

2625    cycles for AxStrLenSSE
1509    cycles for AxStrLenSSE1
1959    cycles for strlen

2616    cycles for AxStrLenSSE
1508    cycles for AxStrLenSSE1
1959    cycles for strlen

2624    cycles for AxStrLenSSE
1507    cycles for AxStrLenSSE1
1959    cycles for strlen

Second:

Code Select


AMD Athlon(tm) 64 X2 Dual Core Processor 4400+ (SSE3)
212     code size for AxStrLenSSE1      212     total bytes for AxStrLenSSE1
104     code size for strlen    104     total bytes for strlen
------- timings for 10240 bytes buffer near page end -------
5722    cycles for AxStrLenSSE
2668    cycles for AxStrLenSSE1
3162    cycles for strlen

3992    cycles for AxStrLenSSE
2907    cycles for AxStrLenSSE1
3247    cycles for strlen

3667    cycles for AxStrLenSSE
2881    cycles for AxStrLenSSE1
3173    cycles for strlen

3863    cycles for AxStrLenSSE
2880    cycles for AxStrLenSSE1
3325    cycles for strlen

Gunther

FORTRANS · March 25, 2015, 01:26:57 AM

Hi Alex,

Two laptops.

HTH,

Steve

The MASM Forum

News:

Research on the algos design

Antariy

Antariy

nidud

jj2007

Antariy

Antariy

jj2007

hutch--

jj2007

rrr314159

Antariy

Antariy

Antariy

Gunther

FORTRANS