The MASM Forum

General => The Laboratory => Topic started by: Antariy on March 24, 2015, 06:42:55 AM

Title: Research on the algos design
Post by: Antariy on March 24, 2015, 06:42:55 AM
This post will prove some points about algos.

The archive attached contains the two programs, their sources and the BAT file which needs to be run to do the actual test - no need to run the EXE files themselves. The programs are SSE2 required so at least PIV machine required to run them.
Tests are very appreciated. Please, if you decide to do the test, then do it in the following scheme:

Run the StrLenTest.bat in the command line with redirection of its output to the text file. Then copy the content of the file to the forum. If the content are too long to fit into the message, split the text content into two posts. Also, if you know that, mention the 2nd level data cache (the cache per CPU core, not full size of all caches of each core nor 3rd level cache - if the CPU has 3rd level cache - mention its size in addition) on your machine.
The full test requires that your machine is able to allocate 64 MB, so it's good idea to run the test on a machine with 256+ MBs of RAM. Though, the partial test requires only small amount of memory which is able to provide every today machine (up to 100 KB).

The testbed used is an old testbed used in 2010 to test StrLen algos which was adapted to the current task.

This post will be spit into two, because of forum's limitation (is it possible to increase it, Hutch?).

What idea does this research prove.

We all know and like to write the algos and to test them on the lab. The problem is in the method which is used in that. Currently we "test" the algos against short (relatively to the 2nd level data cache, read below for explanation) pieces of data, and the test runs many times on the same short data. Being executed in these conditions, some types of algos do not show the real world performance of their.

After first (at least after couple more) interations of the test - call to the code tested and return from it - the short data is already placed in the cache, so the next iterations of the test are going in an unfair (read below why) conditions for the real world performance of the code. So, just first iteration runs with the real world performance, the next, many thousands of iterations run in not real world but "static" conditions, so in final average results the timing of the first run has no influence to show the real performance.

Well, ok, for an instance one has a code which is the "fastest" in these conditions, but this has no much meaning for the real performance of some types of algos. For an instance, one of these types are the "StrLen" type algos.

Why is it so?

Because, in real world, the application runs in the multitasking environment with huge share of the machine resources between the applications, so, even 2nd level data cache may not be considered as exclusive use resource for some program.

But the point which has bigger value is that in the real world system the code almost always executes on the data not in cache. So, it maybe taken as an "axiom" that the code which runs on all the data passed only one interation, then the code should not be tested in the manner when the cache "help" makes the code too look fast. StrLen algos are those algos - the code walks only one interation through all the data passed, so, being called in the real world, where it's hard to find an app which will call in a loop the StrLen algo to compute the size of the same string millions of times, this code will run on the data not in cache and there is no much need to do "crazy optimisations" because all of them after all will have the bottleneck of the memory (RAM) subsystem speed, and, more over, there is a bottleneck of the paging sybsystem of memory.

If the data not in cache - it will be loaded from the RAM, this will always be much slower than the over-sophisticated code may to process.

If the data not in RAM even - but in the paging file (and this case is frequent on the multitasking system with not unlimited resources) - then the slowdown of the OS memory subsystem processing page fault and loading the page from the paging file to the RAM (only then it will be processed to the code) will cost many-many thousands of cycles, so, even incredible-crazy-sophisticated code will have to wait for that with no any respect to its sophistication.

This is the real world for algos of the type described in the axiom above.

The proving.

Well, here is the pretty sophisticated code which proves the point above.

The test consists of testing the algos:

AxStrLenSSE - the simple, one-XMM using (BTW preserving it also) code.

strlen - nudid's code which uses 32 bytes fetch with 2 XMM regs.

AxStrLenSSE1 - the complex code which maybe probably considered as a "state of art" of its kind, which uses 4 XMM regs, i.e. works in a 64 bytes chunk. The task is not very trivial because the simplest way of doing that with not too over-complicated code is with 32 bytes chunk. Being 64 bytes chunk the code has complications because there are possibilities of two 32 byte chunked misalignments, which are processed in the first part of the code. Each possibility uses "simple" 32 byte chunk to process the misaligned head, but the second, in addition, has to be corrected if it was jumped directly or after first part of misalignment correction.

After the correction code goes to the 64 bytes data fetching which in its iteration processes the rest of the string data.

The complex entry of the code which handles the non-64-alignments over 32 bytes actual chunks slows the algo down, especially with string sizes bigger than 64 (it's actual 64 byte string plus zero terminated, for an instance) and smaller than some amount of data which is too small to show the speed up of 64 bytes fetch and the complex entry of the code has the noticeable influence. But with string sized below 64 (63 chars + terminator) it's speed not much slowed down because there are shortcuts with fast returns if the string is smaller than 32 and/or 64. Actually, it's possible to do a shortcut for the bottleneck inbetween 64...256 bytes of the string data, so this will make this algo an "universal winner", but that was not the point - this will, again, make code even more complex and big, for no any reason of the real world performance.

Just look on the timings to understand that. Remember, the testing data is from machine with 256 KB of 2nd level data cache.
The actual data was continuous list, but I'll split it with comments below every block which shows some information.

Code: [Select]
Test with the string size below 2nd level data cache size (reffered to 256 KB size machine)
Intel(R) Celeron(R) CPU 2.13GHz (SSE3)
212     code size for AxStrLenSSE1      212     total bytes for AxStrLenSSE1
104     code size for strlen    104     total bytes for strlen
------- timings for 1 bytes buffer near page end -------
25      cycles for AxStrLenSSE
30      cycles for AxStrLenSSE1
25      cycles for strlen

24      cycles for AxStrLenSSE
31      cycles for AxStrLenSSE1
25      cycles for strlen

24      cycles for AxStrLenSSE
30      cycles for AxStrLenSSE1
25      cycles for strlen

25      cycles for AxStrLenSSE
30      cycles for AxStrLenSSE1
25      cycles for strlen

Intel(R) Celeron(R) CPU 2.13GHz (SSE3)
212     code size for AxStrLenSSE1      212     total bytes for AxStrLenSSE1
104     code size for strlen    104     total bytes for strlen
------- timings for 2 bytes buffer near page end -------
24      cycles for AxStrLenSSE
30      cycles for AxStrLenSSE1
25      cycles for strlen

24      cycles for AxStrLenSSE
30      cycles for AxStrLenSSE1
25      cycles for strlen

24      cycles for AxStrLenSSE
30      cycles for AxStrLenSSE1
25      cycles for strlen

24      cycles for AxStrLenSSE
30      cycles for AxStrLenSSE1
25      cycles for strlen

Intel(R) Celeron(R) CPU 2.13GHz (SSE3)
212     code size for AxStrLenSSE1      212     total bytes for AxStrLenSSE1
104     code size for strlen    104     total bytes for strlen
------- timings for 4 bytes buffer near page end -------
24      cycles for AxStrLenSSE
30      cycles for AxStrLenSSE1
25      cycles for strlen

24      cycles for AxStrLenSSE
31      cycles for AxStrLenSSE1
25      cycles for strlen

24      cycles for AxStrLenSSE
30      cycles for AxStrLenSSE1
25      cycles for strlen

24      cycles for AxStrLenSSE
30      cycles for AxStrLenSSE1
25      cycles for strlen

Intel(R) Celeron(R) CPU 2.13GHz (SSE3)
212     code size for AxStrLenSSE1      212     total bytes for AxStrLenSSE1
104     code size for strlen    104     total bytes for strlen
------- timings for 8 bytes buffer near page end -------
24      cycles for AxStrLenSSE
30      cycles for AxStrLenSSE1
25      cycles for strlen

24      cycles for AxStrLenSSE
30      cycles for AxStrLenSSE1
25      cycles for strlen

24      cycles for AxStrLenSSE
30      cycles for AxStrLenSSE1
25      cycles for strlen

24      cycles for AxStrLenSSE
31      cycles for AxStrLenSSE1
25      cycles for strlen

Well, with string sizes smaller than 16 the "fastest" algo is the first one, one-XMM reg using. If the string is 16+ with the terminator it will be 17+, so it will not fit into the one XMM reg so it will require additional execution and iteration, so after than the algo becomes "slower".

The second, complex algo, has higher overhead of complex entry than the third (nidud's) algo, so that's why it's little bit slower on short strings.

Code: [Select]
Intel(R) Celeron(R) CPU 2.13GHz (SSE3)
212     code size for AxStrLenSSE1      212     total bytes for AxStrLenSSE1
104     code size for strlen    104     total bytes for strlen
------- timings for 16 bytes buffer near page end -------
33      cycles for AxStrLenSSE
30      cycles for AxStrLenSSE1
25      cycles for strlen

32      cycles for AxStrLenSSE
31      cycles for AxStrLenSSE1
25      cycles for strlen

32      cycles for AxStrLenSSE
30      cycles for AxStrLenSSE1
25      cycles for strlen

32      cycles for AxStrLenSSE
30      cycles for AxStrLenSSE1
25      cycles for strlen

Intel(R) Celeron(R) CPU 2.13GHz (SSE3)
212     code size for AxStrLenSSE1      212     total bytes for AxStrLenSSE1
104     code size for strlen    104     total bytes for strlen
------- timings for 24 bytes buffer near page end -------
33      cycles for AxStrLenSSE
30      cycles for AxStrLenSSE1
25      cycles for strlen

32      cycles for AxStrLenSSE
30      cycles for AxStrLenSSE1
25      cycles for strlen

32      cycles for AxStrLenSSE
30      cycles for AxStrLenSSE1
25      cycles for strlen

32      cycles for AxStrLenSSE
30      cycles for AxStrLenSSE1
26      cycles for strlen

Intel(R) Celeron(R) CPU 2.13GHz (SSE3)
212     code size for AxStrLenSSE1      212     total bytes for AxStrLenSSE1
104     code size for strlen    104     total bytes for strlen
------- timings for 32 bytes buffer near page end -------
39      cycles for AxStrLenSSE
43      cycles for AxStrLenSSE1
39      cycles for strlen

39      cycles for AxStrLenSSE
43      cycles for AxStrLenSSE1
39      cycles for strlen

39      cycles for AxStrLenSSE
43      cycles for AxStrLenSSE1
39      cycles for strlen

40      cycles for AxStrLenSSE
43      cycles for AxStrLenSSE1
40      cycles for strlen

Intel(R) Celeron(R) CPU 2.13GHz (SSE3)
212     code size for AxStrLenSSE1      212     total bytes for AxStrLenSSE1
104     code size for strlen    104     total bytes for strlen
------- timings for 40 bytes buffer near page end -------
39      cycles for AxStrLenSSE
44      cycles for AxStrLenSSE1
39      cycles for strlen

39      cycles for AxStrLenSSE
43      cycles for AxStrLenSSE1
40      cycles for strlen

39      cycles for AxStrLenSSE
43      cycles for AxStrLenSSE1
39      cycles for strlen

39      cycles for AxStrLenSSE
42      cycles for AxStrLenSSE1
39      cycles for strlen

Intel(R) Celeron(R) CPU 2.13GHz (SSE3)
212     code size for AxStrLenSSE1      212     total bytes for AxStrLenSSE1
104     code size for strlen    104     total bytes for strlen
------- timings for 48 bytes buffer near page end -------
48      cycles for AxStrLenSSE
43      cycles for AxStrLenSSE1
39      cycles for strlen

48      cycles for AxStrLenSSE
43      cycles for AxStrLenSSE1
39      cycles for strlen

48      cycles for AxStrLenSSE
43      cycles for AxStrLenSSE1
39      cycles for strlen

48      cycles for AxStrLenSSE
43      cycles for AxStrLenSSE1
39      cycles for strlen

Intel(R) Celeron(R) CPU 2.13GHz (SSE3)
212     code size for AxStrLenSSE1      212     total bytes for AxStrLenSSE1
104     code size for strlen    104     total bytes for strlen
------- timings for 56 bytes buffer near page end -------
48      cycles for AxStrLenSSE
43      cycles for AxStrLenSSE1
39      cycles for strlen

48      cycles for AxStrLenSSE
43      cycles for AxStrLenSSE1
39      cycles for strlen

48      cycles for AxStrLenSSE
43      cycles for AxStrLenSSE1
39      cycles for strlen

48      cycles for AxStrLenSSE
43      cycles for AxStrLenSSE1
40      cycles for strlen


The first algo has a noticeable granularity of the 16 bytes step of the string length which "triggers" it to have next step on higher timings.
The second algo continues to behave more or less similar until the data fetch length will be overcomed with at least 64 data bytes + zero terminator.
Third algo shows noticeable granularity of the 32 bytes of data to step higher timings.

Title: Re: Research on the algos design
Post by: Antariy on March 24, 2015, 06:46:21 AM
Code: [Select]
Intel(R) Celeron(R) CPU 2.13GHz (SSE3)
212     code size for AxStrLenSSE1      212     total bytes for AxStrLenSSE1
104     code size for strlen    104     total bytes for strlen
------- timings for 64 bytes buffer near page end -------
50      cycles for AxStrLenSSE
62      cycles for AxStrLenSSE1
47      cycles for strlen

50      cycles for AxStrLenSSE
62      cycles for AxStrLenSSE1
46      cycles for strlen

50      cycles for AxStrLenSSE
62      cycles for AxStrLenSSE1
47      cycles for strlen

50      cycles for AxStrLenSSE
62      cycles for AxStrLenSSE1
46      cycles for strlen

Intel(R) Celeron(R) CPU 2.13GHz (SSE3)
212     code size for AxStrLenSSE1      212     total bytes for AxStrLenSSE1
104     code size for strlen    104     total bytes for strlen
------- timings for 72 bytes buffer near page end -------
50      cycles for AxStrLenSSE
62      cycles for AxStrLenSSE1
46      cycles for strlen

50      cycles for AxStrLenSSE
62      cycles for AxStrLenSSE1
46      cycles for strlen

49      cycles for AxStrLenSSE
63      cycles for AxStrLenSSE1
46      cycles for strlen

50      cycles for AxStrLenSSE
62      cycles for AxStrLenSSE1
46      cycles for strlen

Intel(R) Celeron(R) CPU 2.13GHz (SSE3)
212     code size for AxStrLenSSE1      212     total bytes for AxStrLenSSE1
104     code size for strlen    104     total bytes for strlen
------- timings for 80 bytes buffer near page end -------
55      cycles for AxStrLenSSE
62      cycles for AxStrLenSSE1
46      cycles for strlen

55      cycles for AxStrLenSSE
62      cycles for AxStrLenSSE1
46      cycles for strlen

55      cycles for AxStrLenSSE
77      cycles for AxStrLenSSE1
46      cycles for strlen

56      cycles for AxStrLenSSE
62      cycles for AxStrLenSSE1
46      cycles for strlen

Intel(R) Celeron(R) CPU 2.13GHz (SSE3)
212     code size for AxStrLenSSE1      212     total bytes for AxStrLenSSE1
104     code size for strlen    104     total bytes for strlen
------- timings for 88 bytes buffer near page end -------
55      cycles for AxStrLenSSE
62      cycles for AxStrLenSSE1
46      cycles for strlen

55      cycles for AxStrLenSSE
62      cycles for AxStrLenSSE1
46      cycles for strlen

57      cycles for AxStrLenSSE
62      cycles for AxStrLenSSE1
46      cycles for strlen

55      cycles for AxStrLenSSE
62      cycles for AxStrLenSSE1
46      cycles for strlen

Intel(R) Celeron(R) CPU 2.13GHz (SSE3)
212     code size for AxStrLenSSE1      212     total bytes for AxStrLenSSE1
104     code size for strlen    104     total bytes for strlen
------- timings for 96 bytes buffer near page end -------
63      cycles for AxStrLenSSE
74      cycles for AxStrLenSSE1
54      cycles for strlen

63      cycles for AxStrLenSSE
74      cycles for AxStrLenSSE1
54      cycles for strlen

64      cycles for AxStrLenSSE
73      cycles for AxStrLenSSE1
54      cycles for strlen

64      cycles for AxStrLenSSE
83      cycles for AxStrLenSSE1
54      cycles for strlen

Intel(R) Celeron(R) CPU 2.13GHz (SSE3)
212     code size for AxStrLenSSE1      212     total bytes for AxStrLenSSE1
104     code size for strlen    104     total bytes for strlen
------- timings for 100 bytes buffer near page end -------
63      cycles for AxStrLenSSE
74      cycles for AxStrLenSSE1
54      cycles for strlen

63      cycles for AxStrLenSSE
76      cycles for AxStrLenSSE1
54      cycles for strlen

64      cycles for AxStrLenSSE
73      cycles for AxStrLenSSE1
55      cycles for strlen

64      cycles for AxStrLenSSE
73      cycles for AxStrLenSSE1
55      cycles for strlen

Intel(R) Celeron(R) CPU 2.13GHz (SSE3)
212     code size for AxStrLenSSE1      212     total bytes for AxStrLenSSE1
104     code size for strlen    104     total bytes for strlen
------- timings for 128 bytes buffer near page end -------
87      cycles for AxStrLenSSE
76      cycles for AxStrLenSSE1
63      cycles for strlen

77      cycles for AxStrLenSSE
76      cycles for AxStrLenSSE1
63      cycles for strlen

77      cycles for AxStrLenSSE
75      cycles for AxStrLenSSE1
63      cycles for strlen

79      cycles for AxStrLenSSE
75      cycles for AxStrLenSSE1
63      cycles for strlen

Intel(R) Celeron(R) CPU 2.13GHz (SSE3)
212     code size for AxStrLenSSE1      212     total bytes for AxStrLenSSE1
104     code size for strlen    104     total bytes for strlen
------- timings for 192 bytes buffer near page end -------
105     cycles for AxStrLenSSE
93      cycles for AxStrLenSSE1
84      cycles for strlen

105     cycles for AxStrLenSSE
94      cycles for AxStrLenSSE1
84      cycles for strlen

105     cycles for AxStrLenSSE
94      cycles for AxStrLenSSE1
84      cycles for strlen

105     cycles for AxStrLenSSE
94      cycles for AxStrLenSSE1
85      cycles for strlen

Intel(R) Celeron(R) CPU 2.13GHz (SSE3)
212     code size for AxStrLenSSE1      212     total bytes for AxStrLenSSE1
104     code size for strlen    104     total bytes for strlen
------- timings for 256 bytes buffer near page end -------
129     cycles for AxStrLenSSE
109     cycles for AxStrLenSSE1
103     cycles for strlen

129     cycles for AxStrLenSSE
110     cycles for AxStrLenSSE1
103     cycles for strlen

129     cycles for AxStrLenSSE
109     cycles for AxStrLenSSE1
102     cycles for strlen

128     cycles for AxStrLenSSE
110     cycles for AxStrLenSSE1
103     cycles for strlen

Second algo shows the "cost" of the complex entry + huge data fetch code which cannot "pay its cost back" with too short strings. After crossing the 64 bytes of string data the algo noticeable slower than other two. It overspeed the first algo only after cross the 128 bytes of string data, and hits the third algo after it crosses 256 bytes boundary. After that the algo continues to show its increasing overspeed of the other algos until the data size (string length) is small enough to fit into the cache.

The other two algos behave the same as above - continue to show the granularity of 16 and 32 bytes for first and third algo respectively. Since they do not have to "pay back" the high "cost" of complex entry.

Code: [Select]
Intel(R) Celeron(R) CPU 2.13GHz (SSE3)
212     code size for AxStrLenSSE1      212     total bytes for AxStrLenSSE1
104     code size for strlen    104     total bytes for strlen
------- timings for 384 bytes buffer near page end -------
265     cycles for AxStrLenSSE
144     cycles for AxStrLenSSE1
143     cycles for strlen

264     cycles for AxStrLenSSE
142     cycles for AxStrLenSSE1
143     cycles for strlen

265     cycles for AxStrLenSSE
142     cycles for AxStrLenSSE1
143     cycles for strlen

264     cycles for AxStrLenSSE
143     cycles for AxStrLenSSE1
143     cycles for strlen

Intel(R) Celeron(R) CPU 2.13GHz (SSE3)
212     code size for AxStrLenSSE1      212     total bytes for AxStrLenSSE1
104     code size for strlen    104     total bytes for strlen
------- timings for 512 bytes buffer near page end -------
318     cycles for AxStrLenSSE
181     cycles for AxStrLenSSE1
182     cycles for strlen

317     cycles for AxStrLenSSE
175     cycles for AxStrLenSSE1
182     cycles for strlen

317     cycles for AxStrLenSSE
175     cycles for AxStrLenSSE1
182     cycles for strlen

317     cycles for AxStrLenSSE
177     cycles for AxStrLenSSE1
181     cycles for strlen

Intel(R) Celeron(R) CPU 2.13GHz (SSE3)
212     code size for AxStrLenSSE1      212     total bytes for AxStrLenSSE1
104     code size for strlen    104     total bytes for strlen
------- timings for 1024 bytes buffer near page end -------
525     cycles for AxStrLenSSE
305     cycles for AxStrLenSSE1
417     cycles for strlen

525     cycles for AxStrLenSSE
305     cycles for AxStrLenSSE1
416     cycles for strlen

533     cycles for AxStrLenSSE
305     cycles for AxStrLenSSE1
418     cycles for strlen

523     cycles for AxStrLenSSE
305     cycles for AxStrLenSSE1
417     cycles for strlen

Intel(R) Celeron(R) CPU 2.13GHz (SSE3)
212     code size for AxStrLenSSE1      212     total bytes for AxStrLenSSE1
104     code size for strlen    104     total bytes for strlen
------- timings for 10240 bytes buffer near page end -------
4274    cycles for AxStrLenSSE
2716    cycles for AxStrLenSSE1
3192    cycles for strlen

4285    cycles for AxStrLenSSE
2742    cycles for AxStrLenSSE1
3194    cycles for strlen

4263    cycles for AxStrLenSSE
2733    cycles for AxStrLenSSE1
3194    cycles for strlen

4265    cycles for AxStrLenSSE
2714    cycles for AxStrLenSSE1
3205    cycles for strlen

Intel(R) Celeron(R) CPU 2.13GHz (SSE3)
212     code size for AxStrLenSSE1      212     total bytes for AxStrLenSSE1
104     code size for strlen    104     total bytes for strlen
------- timings for 102400 bytes buffer near page end -------
56469   cycles for AxStrLenSSE
29757   cycles for AxStrLenSSE1
44213   cycles for strlen

56588   cycles for AxStrLenSSE
29776   cycles for AxStrLenSSE1
44368   cycles for strlen

56429   cycles for AxStrLenSSE
29744   cycles for AxStrLenSSE1
44209   cycles for strlen

56706   cycles for AxStrLenSSE
29793   cycles for AxStrLenSSE1
44201   cycles for strlen

Being equal in speed with the third algo, after 256 bytes boundary, the second algo continues to overspeed other algos - with 1024 bytes string it is faster by 25% that third algo, with 10240 bytes sting it's faster by ~15% (probably here we see some interference of the not very big cache size of the machine, some underlying hardware things prevent code to show higher than 25% over-performing, but with next test - 102400 bytes - it will do that), with the 102400 bytes string it's faster by ~33%.

So, here we see that the second algo paid back its cost of complexity and started to overspeed other algos very noticeable.
But this will continue only until the data is fit into the 2nd level data cache. After than the sophistication of the code has no much value at all - the every tested code shows that it's actually limited to the throughput of the memory bus subsystem.

The tests proving that are below.

Code: [Select]
Test with the string size above 2nd level data cache size (reffered to 256 KB size machine)
Intel(R) Celeron(R) CPU 2.13GHz (SSE3)
212     code size for AxStrLenSSE1      212     total bytes for AxStrLenSSE1
104     code size for strlen    104     total bytes for strlen
------- timings for 1048576 bytes buffer near page end -------
1072862 cycles for AxStrLenSSE
1009492 cycles for AxStrLenSSE1
1019562 cycles for strlen

1065923 cycles for AxStrLenSSE
1018654 cycles for AxStrLenSSE1
1022758 cycles for strlen

1066194 cycles for AxStrLenSSE
1077587 cycles for AxStrLenSSE1
1067003 cycles for strlen

1056830 cycles for AxStrLenSSE
1012637 cycles for AxStrLenSSE1
1023902 cycles for strlen

Intel(R) Celeron(R) CPU 2.13GHz (SSE3)
212     code size for AxStrLenSSE1      212     total bytes for AxStrLenSSE1
104     code size for strlen    104     total bytes for strlen
------- timings for 2097152 bytes buffer near page end -------
2121654 cycles for AxStrLenSSE
2034694 cycles for AxStrLenSSE1
2054623 cycles for strlen

2115131 cycles for AxStrLenSSE
2022984 cycles for AxStrLenSSE1
2052423 cycles for strlen

2103862 cycles for AxStrLenSSE
2029368 cycles for AxStrLenSSE1
2055135 cycles for strlen

2121802 cycles for AxStrLenSSE
2033895 cycles for AxStrLenSSE1
2057796 cycles for strlen

Intel(R) Celeron(R) CPU 2.13GHz (SSE3)
212     code size for AxStrLenSSE1      212     total bytes for AxStrLenSSE1
104     code size for strlen    104     total bytes for strlen
------- timings for 4194304 bytes buffer near page end -------
4233774 cycles for AxStrLenSSE
4062312 cycles for AxStrLenSSE1
4118572 cycles for strlen

4232783 cycles for AxStrLenSSE
4094175 cycles for AxStrLenSSE1
4126694 cycles for strlen

4232998 cycles for AxStrLenSSE
4064766 cycles for AxStrLenSSE1
4118622 cycles for strlen

4216718 cycles for AxStrLenSSE
4103574 cycles for AxStrLenSSE1
4114675 cycles for strlen

Intel(R) Celeron(R) CPU 2.13GHz (SSE3)
212     code size for AxStrLenSSE1      212     total bytes for AxStrLenSSE1
104     code size for strlen    104     total bytes for strlen
------- timings for 8388608 bytes buffer near page end -------
8651170 cycles for AxStrLenSSE
8113898 cycles for AxStrLenSSE1
8221349 cycles for strlen

8445985 cycles for AxStrLenSSE
8124165 cycles for AxStrLenSSE1
8210923 cycles for strlen

8444750 cycles for AxStrLenSSE
8386790 cycles for AxStrLenSSE1
8229474 cycles for strlen

8441142 cycles for AxStrLenSSE
8126330 cycles for AxStrLenSSE1
8239519 cycles for strlen

Intel(R) Celeron(R) CPU 2.13GHz (SSE3)
212     code size for AxStrLenSSE1      212     total bytes for AxStrLenSSE1
104     code size for strlen    104     total bytes for strlen
------- timings for 16777216 bytes buffer near page end -------
16873029        cycles for AxStrLenSSE
16244376        cycles for AxStrLenSSE1
16431242        cycles for strlen

16862886        cycles for AxStrLenSSE
16297743        cycles for AxStrLenSSE1
16425690        cycles for strlen

17146142        cycles for AxStrLenSSE
16213937        cycles for AxStrLenSSE1
16427854        cycles for strlen

16937842        cycles for AxStrLenSSE
16326234        cycles for AxStrLenSSE1
16412266        cycles for strlen

Intel(R) Celeron(R) CPU 2.13GHz (SSE3)
212     code size for AxStrLenSSE1      212     total bytes for AxStrLenSSE1
104     code size for strlen    104     total bytes for strlen
------- timings for 33554432 bytes buffer near page end -------
33769089        cycles for AxStrLenSSE
32569609        cycles for AxStrLenSSE1
32847997        cycles for strlen

33761229        cycles for AxStrLenSSE
32440494        cycles for AxStrLenSSE1
32934703        cycles for strlen

33767590        cycles for AxStrLenSSE
32429154        cycles for AxStrLenSSE1
32846578        cycles for strlen

33867550        cycles for AxStrLenSSE
32698092        cycles for AxStrLenSSE1
32856044        cycles for strlen

Intel(R) Celeron(R) CPU 2.13GHz (SSE3)
212     code size for AxStrLenSSE1      212     total bytes for AxStrLenSSE1
104     code size for strlen    104     total bytes for strlen
------- timings for 67108864 bytes buffer near page end -------
67551116        cycles for AxStrLenSSE
64992316        cycles for AxStrLenSSE1
65833018        cycles for strlen

67516614        cycles for AxStrLenSSE
65174418        cycles for AxStrLenSSE1
65777091        cycles for strlen

68162038        cycles for AxStrLenSSE
65261270        cycles for AxStrLenSSE1
65672814        cycles for strlen

67654958        cycles for AxStrLenSSE
64951990        cycles for AxStrLenSSE1
65857864        cycles for strlen

As you see, the algos almost go "head to head". Starting from the 1 MB of the data and until the 64 megs. Do not forget that this data is from 256 KB cached machine, so on your machine you will have other results, regarding to the size of your cache size.

Here we see, than "even" simple first algo goes very close to the other algos - it's just ~4 percents from the the fastest algo here - second algo.

Actually, from that we can calculate, that the throughput of the memory bus was 2133333333/(64951990/67108864) = 2204175368 bytes per second which is 2,05 GB per second. The theoretical memory speed on this machine is a bit above 3 GB per second (PC3200), so, taking in account that the code runs in a multitasking environment, this speed is pretty high and can be considered as near to the maximum of the real world memory bus throughput for the application on the multitasking OS. Also we should remember that the memory actually was not locked - i.e. the data was not commited into RAM memory and locked there, so, paging had an influence on timings, too, without that the throughput should have been higher. But, again, than was a real world test, not a "static" test - as there are almost no possibility that anyone will write a program which allocates their strings and locks them in RAM, as well as no possibility the one doing this will also run the strlen algo against these strings millions of times continuously for the every string.

Conclusion

1) In real world the data passed to the algos are almost always not in the cache, especially if that's data which doesn't get touched very frequently (many times per second).

2) The data may even be not in RAM but in the paging file, which makes access to the data even slower.

3) The algos which do not do more than one iteration over all the data passed to them, taking in account what said in two points above, should not be over complicated, since this will lead in no any gain in the real world. The algos should use as much as possible to the bottleneck of the hardware - any further "synthetical" optimizations will lead to overcomplexity of the code, the code size grow with no gain of the performance.

In this research it is shown that the memory bandwidth is the bottleneck, and there is no sence in testing thousands of times on short data which fits into the cache to "proudly report" about "the fastest algo".

Since the memory bus is the limiter, and because of three points above, the code will not be faster than the memory bus allows it to be.

Regular szLen from the MASM32 lib was twice slower than other algos in the test with big data size (bigger than the cache size) - and that are the real world comparations, though the second algo was faster than the szLen in almost 10 times with small data fitting into the cache - this "speed" has no meaning taking in account the points, research and tests above.
So, the simplest solution here, which will work faster than regular GPRs byte scan and which will work almost as fast (slower than others by couple of percents in avarage) as other more complex and big algos tested here, is the first algo. And that's "enough algo" for the algo of this kind.

How to test the algos which fall under the point 3) above (the "axiom").

Just test your algos on the data which is bigger than the cache size. Let's say at 10 times - and you'll be sure that if it's faster than other algo being bigger and more complex than the other algo, or if than it or as fast as the other algo and at the same time it's shorter and simpler, it's the best algo.

Yes, real world performance of your algos on the real world machine (3 points above and this post at all describes that), even with small data passed to them, will be the same as when you pass them the huge data bigger than cache size and do processing just some times to give average and throw away the paging slowdowns (divide this slowdown between test loops). But their performance will not be such "high" as it is when you test it millions of times on the short data.
Title: Re: Research on the algos design
Post by: nidud on March 24, 2015, 08:19:05 AM
first:
Code: [Select]
AMD Athlon(tm) II X2 245 Processor (SSE3)
212 code size for AxStrLenSSE1 212 total bytes for AxStrLenSSE1
104 code size for strlen 104 total bytes for strlen
8 cycles for AxStrLenSSE
15 cycles for AxStrLenSSE1
12 cycles for strlen

9 cycles for AxStrLenSSE
15 cycles for AxStrLenSSE1
12 cycles for strlen

11 cycles for AxStrLenSSE
15 cycles for AxStrLenSSE1
12 cycles for strlen

19 cycles for AxStrLenSSE
16 cycles for AxStrLenSSE1
12 cycles for strlen

last:
Code: [Select]
------- timings for 33554432 bytes buffer near page end -------
38829510 cycles for AxStrLenSSE
34936987 cycles for AxStrLenSSE1
34119080 cycles for strlen

38787761 cycles for AxStrLenSSE
33859216 cycles for AxStrLenSSE1
34673664 cycles for strlen

39382536 cycles for AxStrLenSSE
33650156 cycles for AxStrLenSSE1
34856036 cycles for strlen

36651461 cycles for AxStrLenSSE
32640715 cycles for AxStrLenSSE1
32766959 cycles for strlen

Title: Re: Research on the algos design
Post by: jj2007 on March 24, 2015, 08:35:56 AM
Hi Alex,
Interesting stuff. Here are i5 results for you :t
Title: Re: Research on the algos design
Post by: Antariy on March 24, 2015, 08:46:58 AM
Thank you, nidud! :t

The testbed is a bit sloppy but shows the idea pretty well.
It was unable to allocate 64 MBs of the buffer? You data contains the info until 32 megs. Or you did just stopped the test?

Your CPU has 2 or 4 MB of 2nd level cache? The timings with 1 MB buffer are already close but still show that the buffer fits into the cache, so this is just a guess that the cache has 2 or 4 mb size since these tests already show that the code run very close independently from the algo.
Title: Re: Research on the algos design
Post by: Antariy on March 24, 2015, 08:59:50 AM
Thank you, Jochen! :t

It's interesting to see that on your CPU the short code runs the same as 32 bytes fething code until 64 bytes string and after cross of 1 MB string.

BTW, you CPU has 1 or 2 MBs of cache? It maybe 1 MB - the timings are unstable yet but already get closer - but with 2 MB they go close, so, probably 2 MB os 2nd level data cache per core?
Title: Re: Research on the algos design
Post by: jj2007 on March 24, 2015, 10:22:39 AM
BTW, you CPU has 1 or 2 MBs of cache?

i5? 2MB, I suppose, but I'm not sure.
Title: Re: Research on the algos design
Post by: hutch-- on March 24, 2015, 12:38:17 PM
I have some funny old fashioned views on benchmarking algorithms. If you want to purely test mnemonic combinations, bashing away at looped data gives you some idea of the difference but as you want to test algorithms in a real world context to see if there is a speed difference, you start to need very large examples so that you effectively break the data cache. If you want to test the linear speed of an algorithm, do it on a gigabyte sample so that cache thrashing is not a factor.

if alternately you want to time the attack of an algorithm (how fast it gets going) you make another massive array of short data items and loop through it while timing the results. The closer a benchmark is to the conditions it will be used in will give you some idea of how fast it is. Short looped timing have the problem that the first few iterations are slower than the rest once the data is in cache and to this extent Alex is correct in that looped timings rarely every give you a real world result. The numbers may look cute but they have little resemblance to how well they perform.
Title: Re: Research on the algos design
Post by: jj2007 on March 24, 2015, 01:38:30 PM
Simple test with a 4.3 MB file:

include \masm32\MasmBasic\MasmBasic.inc      ; download (http://masm32.com/board/index.php?topic=94.0)
  Init
  Let esi="Bible.txt"
  xor ebx, ebx
  .if !Exist(esi)
      Let esi="http://files.allbasic.info/ScriptBasic/Bible.txt"
      inc ebx
  .endif
  Recall esi, t$()
  push eax            ; string counter
  NanoTimer()
  .if ebx
      Rename "~FrLocal.tmp", "Bible.txt"
      PrintLine "Downloaded Bible.txt from ", esi
  .endif
  xor ebx, ebx
  .Repeat
      void Len(t$(ebx))
      inc ebx
  .Until ebx>=stack
  Print Str$("Len(): %i µs", NanoTimer(µs)), Str$(" for %i lines\n", ebx)

  NanoTimer()
  xor ebx, ebx
  .Repeat
      void len(t$(ebx))
      inc ebx
  .Until ebx>=stack
  Print Str$("len(): %i µs", NanoTimer(µs)), Str$(" for %i lines\n", ebx)
  Exit
end start


Code: [Select]
Len(): 1277 µs for 33979 lines
len(): 2499 µs for 33979 lines
Title: Re: Research on the algos design
Post by: rrr314159 on March 24, 2015, 01:47:01 PM
I agree with all of the above - including the half I didn't read
Title: Re: Research on the algos design
Post by: Antariy on March 24, 2015, 08:45:58 PM
Simple test with a 4.3 MB file:

include \masm32\MasmBasic\MasmBasic.inc      ; download (http://masm32.com/board/index.php?topic=94.0)
  Init
  Let esi="Bible.txt"
  xor ebx, ebx
  .if !Exist(esi)
      Let esi="http://files.allbasic.info/ScriptBasic/Bible.txt"
      inc ebx
  .endif
  Recall esi, t$()
  push eax            ; string counter
  NanoTimer()
  .if ebx
      Rename "~FrLocal.tmp", "Bible.txt"
      PrintLine "Downloaded Bible.txt from ", esi
  .endif
  xor ebx, ebx
  .Repeat
      void Len(t$(ebx))
      inc ebx
  .Until ebx>=stack
  Print Str$("Len(): %i µs", NanoTimer(µs)), Str$(" for %i lines\n", ebx)

  NanoTimer()
  xor ebx, ebx
  .Repeat
      void len(t$(ebx))
      inc ebx
  .Until ebx>=stack
  Print Str$("len(): %i µs", NanoTimer(µs)), Str$(" for %i lines\n", ebx)
  Exit
end start


Code: [Select]
Len(): 1277 µs for 33979 lines
len(): 2499 µs for 33979 lines

len is the default MASM32 library procedure?
Ys you see, for your machine the SSE code with data which doesn't fits "neatly" into the cache, the GPR code is slower in factor of 2, the same I mentioned in the big post, but maybe it has some bias on the different machines, that was not a "constant" said.
Title: Re: Research on the algos design
Post by: Antariy on March 24, 2015, 08:58:04 PM
I have some funny old fashioned views on benchmarking algorithms. If you want to purely test mnemonic combinations, bashing away at looped data gives you some idea of the difference but as you want to test algorithms in a real world context to see if there is a speed difference, you start to need very large examples so that you effectively break the data cache. If you want to test the linear speed of an algorithm, do it on a gigabyte sample so that cache thrashing is not a factor.

if alternately you want to time the attack of an algorithm (how fast it gets going) you make another massive array of short data items and loop through it while timing the results. The closer a benchmark is to the conditions it will be used in will give you some idea of how fast it is. Short looped timing have the problem that the first few iterations are slower than the rest once the data is in cache and to this extent Alex is correct in that looped timings rarely every give you a real world result. The numbers may look cute but they have little resemblance to how well they perform.

Yes, that was I said about, too :t Benchmarking of the algos which fall to the point 3) (second part of the bigger post in the start of thread) should be tested via huge massive of data which will flood the cache and clear away its "help" to the algo which actually in real world will not touch the same data more than once (StrLen is such algo - it walks through data bytes just once).

Thought the algos which do number crunching on some limited, fitting into the cache, data, and does than in many iterations, may be tested with the simpler methods like the one we using on the lab - i.e. timing it with many iterations over it and not too big data passed to it.

That was the point and thus I underlined that this is research on the algos design, the algos which fall to the point 3) conditions.
Title: Re: Research on the algos design
Post by: Antariy on March 24, 2015, 08:59:27 PM
I agree with all of the above - including the half I didn't read

:biggrin:

But the other question: Hutch, rrr, where are the timings? :biggrin: Though I'm sure that they will prove the main post of the thread, the timings part which is for short data strings is interesting, too.
Title: Re: Research on the algos design
Post by: Gunther on March 24, 2015, 10:18:01 PM
Hi Alex,

here are your timings. It's an old machine in the university. Should I test it with my machine at home for you?
First:
Code: [Select]
AMD Athlon(tm) 64 X2 Dual Core Processor 4400+ (SSE3)
212     code size for AxStrLenSSE1      212     total bytes for AxStrLenSSE1
104     code size for strlen    104     total bytes for strlen
------- timings for 10240 bytes buffer near page end -------
2616    cycles for AxStrLenSSE
1508    cycles for AxStrLenSSE1
1980    cycles for strlen

2625    cycles for AxStrLenSSE
1509    cycles for AxStrLenSSE1
1959    cycles for strlen

2616    cycles for AxStrLenSSE
1508    cycles for AxStrLenSSE1
1959    cycles for strlen

2624    cycles for AxStrLenSSE
1507    cycles for AxStrLenSSE1
1959    cycles for strlen

Second:
Code: [Select]
AMD Athlon(tm) 64 X2 Dual Core Processor 4400+ (SSE3)
212     code size for AxStrLenSSE1      212     total bytes for AxStrLenSSE1
104     code size for strlen    104     total bytes for strlen
------- timings for 10240 bytes buffer near page end -------
5722    cycles for AxStrLenSSE
2668    cycles for AxStrLenSSE1
3162    cycles for strlen

3992    cycles for AxStrLenSSE
2907    cycles for AxStrLenSSE1
3247    cycles for strlen

3667    cycles for AxStrLenSSE
2881    cycles for AxStrLenSSE1
3173    cycles for strlen

3863    cycles for AxStrLenSSE
2880    cycles for AxStrLenSSE1
3325    cycles for strlen

Gunther
Title: Re: Research on the algos design
Post by: FORTRANS on March 25, 2015, 01:26:57 AM
Hi Alex,

   Two laptops.

HTH,

Steve
Title: Re: Research on the algos design
Post by: nidud on March 25, 2015, 01:47:10 AM
From my experience with testing small algos and doing a real test in the application they are used in shows consistency in the test results.

I will also add that the result from a real test is surprising. It’s often difficult to see how many times these functions are called at any given sequence of code. A few small adjustments could then have a huge impact where these functions are called millions of times.
Title: Re: Research on the algos design
Post by: Antariy on March 25, 2015, 06:13:10 AM
Gunther, thank you! :t Yes, if it's not boring, test it in home, too.

Steve, thank you for test! It seems that both of your tested CPUs have 2 MB of the cache.

From my experience with testing small algos and doing a real test in the application they are used in shows consistency in the test results.

I will also add that the result from a real test is surprising. It’s often difficult to see how many times these functions are called at any given sequence of code. A few small adjustments could then have a huge impact where these functions are called millions of times.

You missed the main point. You talk about "million times called" functions, but where did you see the application which does this with the same string consequently?
If you want to test such a function which may be called millions of times in the real world - well, ok, just do a testbed like Hutch said - create an array of pointers to short/long (as you want) strings, but the number of strings should be very high, so the total size of all strings in sum is much more big than the cache size, and you will see the real wolrd performance of the algo. All other "million times called" types of test are not showing the right result. The algo may be very fast within testbed but in reality it will not perform any faster than much more smaller algo, or even it will be slower. Yes, that's surprising, the same as in the test here you see that 3 algos of different design and size run at the same speed when tested in the real world conditions. You may take that or not, but, actually, yes, our tests here and in the Memcopy thread are actually almost useless, we posted a bunch of the functions but in real world all of them will perform on the same speed.

And, again, each point was described in the big posts - when we talking about those conditions and about algos which fall under point 3), it's impossible to argue, well, it's possible but that disput is useless since if the well explained posts are not proved anything - then there is nothing to disput. Each people has its own opinion and has the right to stick with that, this post was not the "eye-opening reveal", that was just an example which may prove something to those who find it useful.
Title: Re: Research on the algos design
Post by: jj2007 on March 25, 2015, 06:50:53 AM
just do a testbed like Hutch said - create an array of pointers to short/long (as you want) strings, but the number of strings should be very high, so the total size of all strings in sum is much more big than the cache size, and you will see the real wolrd performance of the algo.

Note that the 'bible' testbed in reply #8 does exactly that. Here is a run with 50 bibles, 200 MB, 1519150 different pointers:
Code: [Select]
Len(): 54501 us for 1519150 lines, 200850450 bytes
len(): 120439 us for 1519150 lines, 200850450 bytes
AxStrLenSSE: 53723 us for 1519150 lines, 200850450 bytes
AxStrLenSSE1: 54899 us for 1519150 lines, 200850450 bytes

Len(): 53089 us for 1519150 lines, 200850450 bytes
len(): 111344 us for 1519150 lines, 200850450 bytes
AxStrLenSSE: 54654 us for 1519150 lines, 200850450 bytes
AxStrLenSSE1: 54684 us for 1519150 lines, 200850450 bytes
Title: Re: Research on the algos design
Post by: nidud on March 25, 2015, 07:23:53 AM
My point was the connection between testing and real usage, and my experiece is that the test result is not perfect but valid in this regard.

You missed the main point. You talk about "million times called" functions, but where did you see the application which does this with the same string consequently?

All the time where speed matters  :P

The testbed I use passes a new pointer for each call randomly aligned. But in cases where speed matters these functions work in a loop and thus uses the same memory repeatedly. A call here and there to strlen is not that important speed wise.

Quote
If you want to test such a function which may be called millions of times in the real world - well, ok, just do a testbed like Hutch said - create an array of pointers to short/long (as you want) strings, but the number of strings should be very high, so the total size of all strings in sum is much more big than the cache size, and you will see the real wolrd performance of the algo.

I do this all the time in a real application to select which algo to use, and I don’t see the cache as a problem in this case.

Quote
our tests here and in the Memcopy thread are actually almost useless
...
that was just an example which may prove something to those who find it useful.

I do find the rant useful but somewhat depressing given my experience is not the same.
Title: Re: Research on the algos design
Post by: Antariy on March 25, 2015, 08:24:01 AM
My point was the connection between testing and real usage, and my experiece is that the test result is not perfect but valid in this regard.

It's "valid", but not always. Limitedly valid, let's say that. Yes, you may say "If the algo which is tested in an usual way is faster than other algos, then, almost always (key word is "almost") it will be faster than other's algos in real world".
You are reffering to an experience - but your experience being even unlimited will not disprove some simple, basical things, like the memory bus speed is, like the caching subsystem is (which works well only with the data which is constantly accessed and thus sitting in the cache at least "almost" constantly), and like memory subsystem of an OS is, which is even more slowing down things. OK, well, being reffering to your experience just couple of weeks ago you was fully sure that there is no possibility to write a safe algo and that the "deal" of a speed versus robust is acceptable. "Experience" is just a word, and it will not disprove basical things like caching/memory bus speed limit/OS paging. We all here may refer to our exclusive and limited (every people has limited experience) experience, but that isn't an argument in a disput, that is just a "side note" like "from your experience something was so".

You missed the main point. You talk about "million times called" functions, but where did you see the application which does this with the same string consequently?

All the time where speed matters  :P

The sarcasm and the "tongue" smiley is not the argument. You are unable to disprove what was said here and you will not ever be able to disprove that - take that or not, every people has its own right to be a fanatic.

Try to find the example of algo which will call the strlen millions of times for the same string - consecuenly. Why the word is underlined? Because of you call it from time to time - it will be just in some milliseconds from the previous call flushed from the cache, so, the next call to the code will again go the slow - memory bus speed - way.

And, for your info, no any modern and "widely used" libraries which use strings massively, are actually using the zero terminated strings. Every more or less advanced library used binary strings which save their size in the string object, so the length of the string is computed only once - when it's converted from zero terminated to the string object used by the runtime. So all other operations do not require even any more calls to zero terminated strlen. Zero terminated strings are more like anachronism - just for their simplicity and standardness they are still used. Every other string saving mechanism will require additional info on the string object container, about data layout, for an instance - how the length of the string is stored. Before the string? Or in some field of the object structure? At which offset? And which is the order of bytes of that variable which holds the length? Why do you think the zero terminated strings are still used? Because the "programming world" is still "more C-like", and that simplicity of the string format was from the "portability" of the C, too. The simpler format - the more portable it's... with no regard that it's silly to call code every time you want to get string length.

The strlen optimisation is one of the most liked things to do by hobbist (and not only) programmers, but actually it has little to do with the real gain - it is just the "proud".
But one may hide after words "the speed matters" and do a maniacal things and believe that this produces more speed code. Well, to some extent the optimisation is a right thing, but not more than the hardware can provide. And for strlen the CPU ALU will always be faster than any other systems of the machine, so being kilobytes in size and "maniacally" optimised (but that's illusion), or being just hudred of bytes in size but optimized as much as it was required, the strlen algos will perform the same.

The testbed I use passes a new pointer for each call randomly aligned. But in cases where speed matters these functions work in a loop and thus uses the same memory repeatedly. A call here and there to strlen is not that important speed wise.

Did you read the big post? Did you read what was written here, on these pages? You pass the "new pointer" which points to very limited set of very short strings which are all even being in 100 times longer will still fit into the cache of any more or less modern CPU. So this method of testing doesn't even gets close to the real world testing conditions. The proposal on testing which was written in the big main posts of the thread, the Hutch's advice and Jochen's implementation of the test is the right way of testing the algos which fall into 3) point which you may read in the main post of the thread.

In the cases where the speed important no any code will call the strlen twice - it will call it once and the save the string length, and will manipulate the string then as the binary string, because the therm "zero terminated string" is not speed-wise thing and doesn't fit to the "speed code" ideology.

And, the call to strlen from time to time with different strings - is that which the code will do. No any speed-oriented code will call the strlen more than once for every string.

This is just demagogical disput ::)

Quote
If you want to test such a function which may be called millions of times in the real world - well, ok, just do a testbed like Hutch said - create an array of pointers to short/long (as you want) strings, but the number of strings should be very high, so the total size of all strings in sum is much more big than the cache size, and you will see the real wolrd performance of the algo.

I do this all the time in a real application to select which algo to use, and I don’t see the cache as a problem in this case.

If you don't want to understand what said in the posts of the thread or, more probably, you don't want to agree with that, then I don't know what to say here. If you don't see any problems and think that you doing test a right way - do that, it's your right.

Your testbed is not bad - it's very comfortable to use (I tried it already) and has stable timings, maybe you get that the disput about the wrong way of testing is harming your work - but that's not so. That were points about some types of algos, and that was all more or less clearly stated in the main thread post.

Quote
our tests here and in the Memcopy thread are actually almost useless
...
that was just an example which may prove something to those who find it useful.

I do find the rant useful but somewhat depressing given my experience is not the same.

You may take that as a rants, but probably there are the other peoples who see that it is useful and is true. Given from their experience, too, especially of they don't try to decline some basical things in favour of their pride.

That is your opinion, and you have right for it. For my opinion, for an instance, the Intel's 2014 strlen is just a marketing stuff and no more. Yes, big and "impressing" code, well, that's how the every business has to work - impress the users, and they will respect you. But after all that code in real conditions will work not faster than algo which is 10 times smaller? Don't believe? Well, that's your right. But more over, the bigger algo which falls to the point 3) will work even slower. Why? Because it is not always in 1st level code trace cache. And being executed let's say one microsecond later than the previous call to it was done, the algo, being long and bloated, will be longer decoded, the shorter algo will actually already return the string length but the bloated but "impressive looking" algo will at the same time will be only decoding to be fed to the CPU core. Well, but this is probably rant, too, so if anyone has right to stick with its opinion - the disput is useless.
Title: Re: Research on the algos design
Post by: Antariy on March 25, 2015, 08:32:19 AM
just do a testbed like Hutch said - create an array of pointers to short/long (as you want) strings, but the number of strings should be very high, so the total size of all strings in sum is much more big than the cache size, and you will see the real wolrd performance of the algo.

Note that the 'bible' testbed in reply #8 does exactly that. Here is a run with 50 bibles, 200 MB, 1519150 different pointers:
Code: [Select]
Len(): 54501 us for 1519150 lines, 200850450 bytes
len(): 120439 us for 1519150 lines, 200850450 bytes
AxStrLenSSE: 53723 us for 1519150 lines, 200850450 bytes
AxStrLenSSE1: 54899 us for 1519150 lines, 200850450 bytes

Len(): 53089 us for 1519150 lines, 200850450 bytes
len(): 111344 us for 1519150 lines, 200850450 bytes
AxStrLenSSE: 54654 us for 1519150 lines, 200850450 bytes
AxStrLenSSE1: 54684 us for 1519150 lines, 200850450 bytes

Thank you, Jochen, that's the right implementation of the test for this kind of algo (from my subjective point of view) :t
Actually, if you did tested the "SSE1" code from this thread in that test - then I would recommend not to used it for the everyday work, because it was unfinished and more over it was just "dirty hacked" into testbed. It works properly in the testbed - and you may be sure in that because first thing the test is doing is the checking if the algos do return right results - but it doesn't designed to be used as the algo in the applications because I was too lazy - it's a big and "on fast hand" redesign of almost different algo and is sloppy, but it does its work properly and works for the testing and research purposes described here. Anyone may just replace it with Intel's Silvermont algo if one has any doubts on the idea proved in this research.
Title: Re: Research on the algos design
Post by: nidud on March 25, 2015, 11:24:30 AM
You are reffering to an experience - but your experience being even unlimited will not disprove some simple, basical things, like the memory bus speed is, like the caching subsystem is..
Quote
What idea does this research prove.

We all know and like to write the algos and to test them on the lab. The problem is in the method which is used in that. Currently we "test" the algos against short (relatively to the 2nd level data cache, read below for explanation) pieces of data, and the test runs many times on the same short data. Being executed in these conditions, some types of algos do not show the real world performance of their.

After first (at least after couple more) interations of the test - call to the code tested and return from it - the short data is already placed in the cache, so the next iterations of the test are going in an unfair (read below why) conditions for the real world performance of the code. So, just first iteration runs with the real world performance, the next, many thousands of iterations run in not real world but "static" conditions, so in final average results the timing of the first run has no influence to show the real performance.

In the testbed the functions are read into a virtual protected area of the code segment, and they are all called in a validation test before the timing begins. This means they all start with all the data already in the cache.

Quote
OK, well, being reffering to your experience just couple of weeks ago you was fully sure that there is no possibility to write a safe algo and that the "deal" of a speed versus robust is acceptable.

Don't remember saying it was impossible or acceptable, but I assumed there was no danger in the environment where the functions was used.

Quote
"Experience" is just a word, and it will not disprove basical things like caching/memory bus speed limit/OS paging. We all here may refer to our exclusive and limited (every people has limited experience) experience, but that isn't an argument in a disput, that is just a "side note" like "from your experience something was so".

In this case the "experience" is based on testing with some good results.

Quote
You missed the main point. You talk about "million times called" functions, but where did you see the application which does this with the same string consequently?

All the time where speed matters  :P

The sarcasm and the "tongue" smiley is not the argument. You are unable to disprove what was said here and you will not ever be able to disprove that - take that or not, every people has its own right to be a fanatic.

 :biggrin:

Sorry, didn't see the tongue there, but one tend to use the same memory in these loops.

Quote
Try to find the example of algo which will call the strlen millions of times for the same string - consecuenly. Why the word is underlined? Because of you call it from time to time - it will be just in some milliseconds from the previous call flushed from the cache, so, the next call to the code will again go the slow - memory bus speed - way.

And there I disagree. When you tokenize a text file you tend to reuse the same memory. The same with reading files where you use the same file block which will end up in the cache before you call strlen to get the length of the file. There could be millions of files down the subdirectories, and you may open all of them and do a byte scan and so on.

Quote
You may take that as a rants,

Maybe a bad choice of a word, but there was a lot of text there  :biggrin:

Quote
Well, but this is probably rant, too, so if anyone has right to stick with its opinion - the disput is useless.

No, that was interesting, and I disagree, dispute is not useless.
Title: Re: Research on the algos design
Post by: rrr314159 on March 25, 2015, 11:50:35 AM
AMD, 2 mb L2 cache

Intel, 1 mb L2, 6 mb L3

(I think)

This testing is a PITA I expect at least a 1500-word post from you after all this trouble :)
(From other thread, Antariy advice:) On AMD use ax with bsf - Thanks! I should have thought of that :redface:

[edit] that smiley :P isn't sticking out his tongue - is he?
Title: Re: Research on the algos design
Post by: jj2007 on March 25, 2015, 06:50:38 PM
When you tokenize a text file you tend to reuse the same memory. The same with reading files where you use the same file block which will end up in the cache before you call strlen to get the length of the file.

When you tokenise a 200MB file, then only a tiny fraction is in the data cache. Afterwards, the pointers may fit into the cache, but the algo needs to examine the 200MB content, not the pointers.

> before you call strlen to get the length of the file.

That sounds as if you want to use strlen for getting the 200MB info. That is hardly a real world usage; you open the file, you know its size. What you don't know yet is the length of many tiny strings (actually, MasmBasic stores the lengths during the tokenising process, but that's another topic).
Title: Re: Research on the algos design
Post by: nidud on March 25, 2015, 09:42:15 PM
When you tokenise a 200MB file, then only a tiny fraction is in the data cache. Afterwards, the pointers may fit into the cache, but the algo needs to examine the 200MB content, not the pointers.

You think JWASM or similar programs use a 200M read buffer? The read process is normally streamed with a small buffer size.
Code: [Select]
_MAXIOBUF equ 4000h
_INTIOBUF equ 1000h
_MINIOBUF equ 0200h

Each line is then normally fetched to a line buffer using fgets(). Both the read buffer and the line buffer is now in the cache.

Quote
That sounds as if you want to use strlen for getting the 200MB info.

The buffer in the file block for the file name is 260 byte. Scanning around 100K files gives this result:
Code: [Select]
Average string lengths: 22, calls: 103780

So what is the average string length in general?
I will assume closer to 20 byte then 200M.
Title: Re: Research on the algos design
Post by: jj2007 on March 25, 2015, 10:07:30 PM
Hi nidud,

It seems we have a serious communication problem - probably my fault.

What I am trying to say is:
- cache influence doesn't matter if you read in or stream in 200 MB; in contrasting to testing one (cached) string a Million times, the algo needs to access memory if the file size is well beyond the 2 or 3 MB of the L3 cache.

- the average string len of asm files is around 20 bytes, sure, for C headers it is around 35 bytes, and for the bible it's 132 bytes; so for testing real world apps, around 50 bytes might be a good compromise.

Another test for non-cached files: sorting a text file. Again, bible.txt is being used, and two algos are being compared:
Code: [Select]
Intel(R) Core(TM) i5-2450M CPU @ 2.50GHz
Masm32 assort:
30382 lines sorted in 0.014 seconds
30382 lines sorted in 0.011 seconds
30382 lines sorted in 0.012 seconds
30382 lines sorted in 0.012 seconds
30382 lines sorted in 0.011 seconds
MasmBasic QSort:
30383 lines sorted in 0.012 seconds
30383 lines sorted in 0.012 seconds
30383 lines sorted in 0.012 seconds
30383 lines sorted in 0.012 seconds
30383 lines sorted in 0.012 seconds

Bible.txt is 4MB, so it can only partially be cached. The attached archive contains a BibleSort50.exe, which creates and tests a 200MB file from bible.txt. And that one definitely doesn't fit into the cache, so be prepared for big differences between CPUs and memory characteristics like bus speed and cache size.
Title: Re: Research on the algos design
Post by: hutch-- on March 25, 2015, 10:13:59 PM
Bible is not a bad test piece at about 4.5 meg, if you need a decent big sample, just knock yourself up a multiple copy version, copy it onto itself until its a few hundred meg, then you will know what a linear algo speed is like. If you have a reasonable tokeniser, tokenise it first THEN time its read/write/modify speed on a massive array.
Title: Re: Research on the algos design
Post by: jj2007 on March 25, 2015, 10:24:58 PM
if you need a decent big sample, just knock yourself up a multiple copy version, copy it onto itself until its a few hundred meg

Good idea :t
The attached archive contains a BibleSort50.exe, which creates and tests a 200MB file from bible.txt.
;)
Title: Re: Research on the algos design
Post by: dedndave on March 25, 2015, 10:44:12 PM
god's gonna strike you down for using the bible as "random data"   :badgrin:
Title: Re: Research on the algos design
Post by: jj2007 on March 25, 2015, 10:54:33 PM
god's gonna strike you down for using the bible as "random data"   :badgrin:

I am just providing this important text in a more ordered version 8)
Title: Re: Research on the algos design
Post by: nidud on March 25, 2015, 11:44:05 PM
What I am trying to say is:
- cache influence doesn't matter if you read in or stream in 200 MB; in contrasting to testing one (cached) string a Million times, the algo needs to access memory if the file size is well beyond the 2 or 3 MB of the L3 cache.

My take is that it’s better to play along with the cache than remove it from the test. My disagreement with Alex is that I think most data in a "real world" application is also cached. The IO-streams often used takes this into account by using small buffers and so on.

However, none of this matters for the test given all algos are fed with a) cached data, or b) uncached data. The result should be the same in this case.

It’s also difficult to know how the cache works with a 200M buffer. If the size is 2M (1%) and the file have 100 lines, you may argue that 10 lines will be uncached and 90 of them cached.

Quote
- the average string len of asm files is around 20 bytes, sure,

Maybe, but in this case it was the length of the file name (or maybe the path) and not a line.

Quote
for C headers it is around 35 bytes, and for the bible it's 132 bytes; so for testing real world apps, around 50 bytes might be a good compromise.

I’m not sure the length of a line is a good measure. Each token have a length and a pointer:
Code: [Select]
p2 = p + strlen(tokenarray[q].string_ptr);

Each of these words in the line ["mov",  "eax", ",", "1"] will be exposed to strlen, strcmp, strcpy, … , mutiple times in the parsing process in this case. I think the average length of a string in a token parser in general, like html, C, asm, and so on is less than 20 bytes.
Title: Re: Research on the algos design
Post by: Antariy on March 26, 2015, 04:42:46 AM
rrr,

Antariy,

Thanks very much for your comprehensive answer but I hate to make you type so much! No need to spell everything out in such detail. I see you have strong feelings about "protected programming" and Lingo so will leave those topics alone! As for "topic paster", it was just a joke, no worries.

Well, my feelings about those topics are just IMO. "Protected programming" is not a panacea, since the program after all runs on not perfect OS, which in its turn has its own bugs and problems. As for Lingo - I do not have any feelings about that person at all, well, that's what, after all, he deserved in my opinion: the "name" not of a "great coder" but an immature and uncultured person with exaggerated selfopinion.
As for "topic paster", well, just cleared that out so you didn't think that that was written not as sligthing word.


The important thing is the code. I understand your reasoning (of course) but it offends my aesthetic sensibility to see all those bytes wasted! So, considering your suggestions, I made some modifications, and tested the shorter algo against yours as well as I could. Bottom line, it seems the shorter algo is, for the most part, faster.

Well, those bytes on some machines may be "wasted" on some may be "useful" - that has the only difference where it will run. This is something similar to the other code which was actively discussed here - the hex2dword algo. There my initial code was sloppy, but it implemented an idea of unified processing of the hex string - with no difference to the case of the letters - using one path without checking and correction of the letters to be some "hardcoded" case (upper or lower). It worked well, but with statical tests (just like we used with the StrLen here) it was slow and I was drawn into discussion which took some days and many tries to make the algo work "faster" on modern CPUs with the "statical" test. That was hard thing to do because all my "optimizations" I was done - didn't make the algo any faster on my own CPU, so I was optimizing it just "by touch" and analisis. Well, after all the algo became "the fastest" (it was one cycle faster than the previous "winner" on Core2+ CPUs :greensml: And it was ~2 times faster on my CPU - but it was also faster than the previous "winner" before optimisations, so that was done just for "sake of the challenge" ::)) algo on every (!) CPU it was tested there - but that cost much of time and tries, with, actually, no reason, because the dumb unrolled code will always be "faster". Well, I was testing that here for probably first time from registration on the forum, and after finding that the unrolled algos will for sure be faster even in real world conditions, I chose to use it instead of short looped algo.
But, for what I telling this you? Because the situation is very similar now. And because that my algo which was the "winner" between looped algos in that hex2dword challenge, also did use a lot of those "wasted bytes" to do its task, and it did that. The algo was the fastest only when used in a "bloated" manner, when you make the instructions smaller and instead of pre-checking the starting condition before loop, just jump into the loop, the algo was not "winner" - it was slower by couple of cycles but that was important there because all the discussion there was in the "who are the fastest?" manner. So the code was bigger than it might to be for, probably, 30% there, just because of the "padding long instructions".
So, if I want to take the most from some piece of code, especially if that's the short-cuts like code entry and some prechecks before going into the inner loop, I do not hesistate to use longer instructions because this may save some cycles when the code does a short-cut and the timings are small in any case with any other algo, so those saved cycles maybe useful to "show" the better timings ::) Well, yes, as you may notice, the timings become a an "idol" in these tests. People look at them, respect them, compare and align theyselves code with that, so, aligning to the code timings is, probably, not the best thing because it, after all, leads to the patterned code design. The interesting point is: my hex2dword looped code was not pattern-like type in its idea, and, though initially the idea was implemented "slowly", after the "contest" it was the fastest - though it was still the same code with the same idea, and, more over, it works on any 32 bit CPU starting from i386, at the same time the previous "winner" was only P6+ compatible code (due to CMOV). The better idea, which avoids the conditional processing (jumps or movements) at all, just won the patterned idea at speed and the compatibility with the CPUs it may run on. But how much time it took to crazly-optimize the code, make it more long, bloated (still, we talk about less than 80 bytes), to win the "contest"? It's useless thing to do in an "everyday manner" - just if you feel that mood then you may want to do that. That's why I just used the longer instructions to align the loop with not too much other tests with other instructions lengths - because on every CPU it will run differently, a bit faster on one and a bit slower on the other, and after all it will run on the real world conditions where all algos will go "head to head" even being crazly complex and sophisticated. I just wrote the code with good idea in it - you may notice than my 16 bytes align on the code entry is even smaller and simpler than the Intel's one in the Silvermont code. That was the first step in the algo, and it was simple and elegant, the next step was then simple processing in the aligned inner loop. With the aligned data. Though the loop itself may be for some opinions aligned not very elegant with all those bytes, but the data alignment is elegant, so the code is elegant itself. I agree that it may look not "aestetically", but look at it as on a simple code which does its job well. Yes, it may be modified and tweaked in a bunch of a ways to run better on some specific CPU, but, as I said, it was written to perform "well" on every CPU - not to perform "the best" on some family or even model of CPU. And it's simple, yes, maybe a bit "sloppy", yes, it may be optimized to be smaller/faster, but that was not the aim since the reasons written in the "Research" thread.


First, to avoid register stall with these instructions: mov edx, [esp+4] / mov eax, edx: I just moved the mov down 2 instructions.

I didn't know xorps could save a byte over pxor, thank you. I used those 2 bytes to put the jump back in. It was necessary for the AMD which is horribly slow on bsf. I jump into the loop, skipping the "add edx, 16", so the logic remains valid.

Still preserving XMM and ECX.

Yes I know the purpose of the db instructions is to pad 9 extra bytes to align the beginning of the loop. I know that's better than nop's or lea eax, [eax] which waste time as well as space. But surely it's even better to not waste the bytes, as long as routine is still as fast or faster.

CPU branch prediction - u know, behavior seems to change with every new release from Intel or AMD. Often, we optimize a routine on our own machines, but on newer/older machines it may behave different. I routinely optimize on my Intel I5, then try it on AMD A6 and Pentium 4; often fastest on one machine may be slowest on another. So I'm leery of artificial coding techniques for speed.

Now, I had thought you were right: pointer advance should go AFTER data fetching. However on the Intel my loop was faster. On AMD, a little slower. Apparently the order of the two instructions makes little difference. Anyway, there's very simple correction available, if needed. Just "pcmpeqb xmm7, [edx+10h]" first, then "add edx, 16" - uses one more byte.

By far, the hardest part of the whole exercise is not writing the routine, but getting semi-reliable timing! First, I used your suggestion and put all algos in separate segments. Then, I doubled both of them; put yours first and last, mine 2nd and 3rd. It appears the "last" position is slightly favorable. Then, in desperation, I copied/pasted the timing runs a number of times, using just the final numbers.

Well, that probably will avoid the stall, yes, depending on how many execution units are in the CPU core and how deep is the further independentness of those instructions. I.e. - moving down the mov (the pun), you then moved it closer to the next and of that reg which you moved into, so, removing the stall above you then provoking it below - since the entry code is very compact and just has no additional space to spread instructions over.

Also you may use movaps/movups instead of movdqa/movdqu - it's the same 1 byte shorter. Just a note, because in the code it's already used movups.

Yes, I agree that if the routine is smaller and stable faster than the previous version on every CPU, then it's better to use the shortest version. As I said above, I was just not playing with that much.

CPU brances were always "predicted" as "not taken" for the jump further and as "taken" in the jump back, and this logic will probably not change in any CPU because this is simple logic of the loops - you do the work, then you check the condition, then you jump back to the start of the loop - this is more efficient in the speed meaning, yes, it's inefficient in the code size meaning, you need first do the check for the condition before you enter into the loop, but that was probably considered as a deal to the more speedy exit from the loops and more speed work of the loops - the CPU "sees" than the jump (especially conditional) is taken to the "back", so it "knows" that this is the loop and that it is almost always more probably that the jump will be taken than that it will not be taken - because loops are always executed more than once, they may execute thousands or millions of times, so, if the CPU will each iteration "mispredict" the loop jump, it will cost very much cycles. So, the CPU just treats the conditional jump back as "almost unconditional jump" - and only one misprediction may occur while the loop ends - but that's one small stall after thousands of properly jumped jumps (the pun again). So, the conditional jump back, when it's taken by the logic, costs almost nothing for the timings. But if the jump is untaked (final of the iterations) - it will have more clocks since it's "mispredicted". The same with the jumps further - the CPU "thinks" than it will not be taken, and if the jump further is taken actually - this costs some clocks because it's "misprediction". If the jump further is not taken in the code run - then it costs almost nothing to execute than Jcc instruction. And you may notice this technique is widely used, for an instance in the highly unrolled Intel's Silvermont algo, where the code has much conditional jumps further - those jumps cost almost nothing, and one of them, being jumped after all, will cost some cycles, but after than jump the code will finish its work, so that's not the matter for the deal. The code saves those cycles using the further jumps test, after execution of some those almost nothing cost jumps the code already "prepaid" the stall of the taken jump further, and taking in account it will finish/go closer to the finish after than this "misprediction" of jump further has no any to care about.

I think that pointer advancement before or after a fetch is a very disputable stuff due to every particular CPU family and even model's implementation - on some there's no slowdown (at least noticeable) when you put the advance before the fetch, on some other there is very noticeable slowdown, so, the matter of choise is that, probably.

Yes, it's hard to get reliable timings, that's why we use the testbed on the lab with million of the repeats to get more or less stable timings.


Your routine was a bit better on Intel ALIGNED 271; also slightly better on the longer strings on AMD. Everywhere else, the shorter routine is better. It's dramatically better on AMD short strings, who knows why; and better on Intel long strings. BTW all these numbers came out pretty much like this on multiple tests; I'm only showing one typical run from each machine.

Here is my modified routine:

Code: [Select]
; «««««««««««««««««««««««««««««««««««««««««««««««««««
algo2 proc SrcStr:PTR BYTE
; «««««««««««««««««««««««««««««««««««««««««««««««««««
; rrr modified version of Antariy algo - number 2
    mov eax,[esp+4]
    add esp,-14h
    movups [esp],xmm7
    mov edx, eax
    mov [esp+16],ecx
    and edx,-10h
    xorps xmm7,xmm7
    mov ecx,eax
    and ecx,0fh
    jz intoloop
    pcmpeqb xmm7,[edx]
    pmovmskb eax,xmm7
    shr eax,cl
    bsf eax,eax
    jnz @ret
    xorps xmm7,xmm7
    @@:                       ; naturally aligned to 16
        add edx,16
    intoloop:
        pcmpeqb xmm7,[edx]
        pmovmskb eax,xmm7
        test eax,eax
        jz @B
    bsf eax,eax
    sub edx,[esp+4+16+4]
    add eax, edx
   
    @ret:
    movups xmm7,[esp]
    mov ecx,[esp+16]
    add esp,14h
    ret 4
algo2 endp
end_algo2:

Bottom line, I can't believe it's right to waste those 18 bytes.

Well, as I said, if you can make it smaller and you see that it's reliably faster, then that thing is proper thing to do, I did not argue with that and do not try to prove that all those bytes MUST be there :biggrin: Also you may avoid saving of xmm and ecx, this will save you 3-5 cycles too, which will be especially important with short strings - the case when the speed of algos is a matter of the speed of its entry and exit and not of inner loop speed. Here is another tweak of the algo, did not test it though. But in real world it will perform pretty well. Some rearangements in the data fetch and the entry code - since the XMM and ECX saving is avoided so there is no need to use the same reg XMM.

Code: [Select]
algo2 proc SrcStr:PTR BYTE
; «««««««««««««««««««««««««««««««««««««««««««««««««««
; rrr modified version of Antariy algo and then again modified by Antariy :) The chain of modifications :)
    mov edx,[esp+4]
    mov ecx,[esp+4]
    and edx,-10h
    xorps xmm7,xmm7
    sub ecx,edx
    ;movaps xmm6,[edx]
    db 0Fh,28h,72h,0
    pcmpeqb xmm6,xmm7
    pmovmskb eax,xmm6
    shr eax,cl
    jnz @ret
    @@:
        movaps xmm6,[edx]
        pcmpeqb xmm6,xmm7
        add edx,16
        pmovmskb eax,xmm6
        test eax,eax
        jz @B

    sub edx,[esp+4]
    bsf ax,ax
    add eax,edx
    ret 4
   
    @ret:
    bsf ax,ax
    ret 4
algo2 endp


It will be probably 68 bytes and the loop should be still aligned at 16 with only one byte of padding in longer "than required" instruction :biggrin:

Finally, of course I can write Russian! I'll do it again: "Russian". - very easy. OTOH I can't write in Cyrillic to save my life :)

:biggrin:
Title: Re: Research on the algos design
Post by: Antariy on March 26, 2015, 04:45:20 AM
There's a problem with this whole exercise: same code runs very different on different processors.

Quote from: nidud
but this seems to faster for some reason

I do the same thing: try different combinations and select the fastest on my machine, often not knowing why it's faster. Problem is, what happens on another machine?

Well, it runs different but not totally different, and, being talking about real world conditions you may be sure that the code which gets close to the bottleneck - the memory bus bandwidth in this case - will run well on every machine.

What points do we have? We have the algo, which does the continuous access to every bytes of the string, and does this just once per a call. So, as was written in the "Research" thread, you may assume that the algo will work with the maximal speed which the memory bus is providing. With no difference how complicated code is - if it has hit the bottleneck - it will sit there in real world usage. So, we have the poins: just as an example, the DDR1 memory has "wire width" 64 bits, 8 byts. With the effective datarate 400 MHz this will be 400 000 000 * 8 = 3 200 000 000 bytes per second (That's why the DDR400 standard is reffered as PC3200, too), so a bit more than 3 GB per second. At the same time the CPU has access to its internal memory - the cache - with the speed more over than 20 GB per second. Yes, even with old CPUs like PIV are, we have seen the ~30GBps data rate for algos tested on the lab (that was, for an instance, the lines count algo, tested with the dataset fitting into the cache). 30 GBps - unbelievable, yes? Compare it with 3 GBps. So, actually, you may think that 9/10 of the algo work will be in waiting the memory bus to deliver data to the ALU.
Another point: DDR "wire" width is 8 bytes. But the XMM move takes 16 bytes. So, the one (!) xmm move takes at least two transactions to the memory bus. And it's even more complicated if you will take notice than CPU doesn't fill the cache with those "just asked" 16 bytes, it will fill entire 32/64 bytes cache line. That's why the code which accesses the sequental data is more performant, than the code which accesses the data in the random fashion - because being accessed to the memory location with the address which has zero lowest 5/6 bits, the code waits not only for the data required, but for entire cache line fill from the memory, i.e. the 32/64 bytes are transferred (at least) from the RAM into the cache line. So, the code which accesses in the manner "the byte here, and the byte here" - not consequently - will cause with each access the tranfer of entire cache line, even if the code was accessing only one byte.
So, from this point of view we see that the only one "the most effective" access to the memory is with 32/64 bytes (this is not constant, this is depended on the CPU generation and may differ) chunk sequental access. All the other accesses are "less effective", but just in theory, because after all the code sticks with 16-bytes wide XMM regs and that's the maximum which the 32 bit CPU ALU can handle at a time. If we had 64 bytes regs - the code using them will have the most efficient corellation between 64 bit aligned memory transfers and ALU operations.

The other point which you maybe saw and wondered: why it maybe so that pcmpeqb xmm0,[ebx] is slower (!) than movaps xmm0,[ebx]; pcmpeqb xmm0,xmm7 (being said xmm7 is zero)? That is because of uneffective transactions between 16 bytes regs and 8 bytes data bus, for some reason the direct comparsion does this transaction not very well, but the movement operation is better optimized in hardware. It's the thing like that: mov eax,[somevalue]; push eax is faster than just push [somevalue]. That may look strange, yes, but that's hardware implementation specifics.


We do the optimisations, but after all we are behind the hardware implementations, and that's the only point why the algos perform so differently. But, then, you'll think that probably it's better to write the code, which will perform equally well on every machine, rather than it will be optimized insanely for the single type machine. We are talking not about industrial programming where you fit your code under the hardware, but about code which should work in a "common machine".
So, thinking so (pun), you will write the "code which is enough to do its task" and which will be sophisticated enough just to hit the hardware limitations. In this extent ANY more or less well written SSE code will run much more faster with the data sets which are fit into the cache. So, now you have all the info to design your code, depending on that it will do, which amount of data - fitting into the cache or not -  it will process and in which manner it will do so - repeat the "number crunching" loops over entire data again and again before it will produce result, or it will loop throug entire data just once (like StrLen does) and, actually, doesn't do any huge numerical processing things.
So now you are sure that GPR's written strlen code is not suitable for you, because it's too ineffective in the regard of memory throughput as well as ALU operations on the smaller data size. So, you know you need the code which uses more advanced size of data processing and more effective interaction with memory, and you need the code to be able to run on wast majority of the machines all over the world. So you choose SSE2 code. Then you think on how the code will process the data - will it repeat much inner loops over the data to provide the return result? If not - then you may forget about thing "cache" with all its help - there's no effective interaction with memory, there is no cache-like speeds, there is just the maximum which the system can provide - the RAM speed. And the SSE code will be faster than the RAM speed, so even sloppy SSE code will beat (if it's not toooooo sloppy) the best possible GPR code. So there is no need to write the algo mile long and bloated - it anyway will not do any much other than impressing the public with its own size.



I put together a new "AxStrLenSSE" which beats the Silvermont on my i5. It was easy. First, notice that Antariy's original algo saves XMM and ECX, because of environment it was designed for; but Silvermont doesn't. So I took that out. Then, it was still slower on the long strings. So, I unrolled the loop 9 times, and it easily "won" on nidud's test bed. But then I tried it on my AMD, and it came in 3rd (SSE 32 also ahead)! (AMD is very bad for bsf, pcmeqpb and pmovmskb.)

...

Anyway, here's this new version of Antariy's algo, which is better on i5's than Silvermont, but worse on AMD ... I know there are various ways to improve it, but I think I'll do something else instead!

Yes, for sure you will optimize it for you machine the best possible.

As for an AMD - try to use as much as possible differen XMM regs and do pmovmksb into as much as possible GPR regs. AMD tends to have better performance with the code which is "truly independent", probably its "register renaming" mechanism is weaker than Intel's one is, though, when you write unrolled and really independent (based on the names of registers) code, it will perform probably even faster than Intel's one. But this is just a note, I do not claim that this will work for every existing AMD CPU.

Working at Intel, Torbjorn Granlund has access to hundreds of machines (probably) and can try his Silvermont algo on all of them. If you don't have similar facilities there's no point in trying to keep up with him. I also think it's ridiculous to use 700 bytes for a strLen algo. This is one reason Microsoft and other modern code is so bloated.

Just the same, Torbjorn might have some good ideas. There is only one thing he does I've never seen before, uses pminub to take care of the situation you (Antariy and nidud) are discussing right now. That is, integrating the results from 2 or more XMM's reading the string. Perhaps you should take a look at his technique.


Well, the place where he works doesn't mean he produces the best possible code. Again, those tests which prove it as the fastest possible are exceptionally syntetical. In real world there is no need on almost kilobyte code which will "do well" only if the string it was accessed to WAS ALREADY in the cache totally at the time you called the code. Being part of the application which works under multitasking OS, this code may forget the word "cache" and it will, for sure, on real machine in real app execute not faster than other SSE strlen which are even less than 100 bytes long. More over, not only the data cache gets overflooded with the other data, but the code cache, which is smaller, gets overflooded many thousands of times per second, so being executed not in exclusive environment where the only one thread of one app owns all the CPU core time, the longer code will also stumble over a decoding stalls each time it gets called. The marketing stuff like that one is nothing more than marketing stuff, just syntetical test like we are using with millions of calls to the same code with the same short data on the lab.
In any point as you will see it - you will find that the code like that is nothing really useful in real world app. Probably that's something relating to the Freud :greensml: Big guns, big cars, big ships, big codes :greensml: People like that - the bigger the better they think - just psychology of mankind.

Here we have the wide set of the machines - from old ones to the newest ones, here I agree with nidud, at the forum we have pretty wide set of machines to test on and to collect the info about optimisation.

The Silvermont code is pretty logical and simple - just heavy unroll checking till the buffer gets aligned to 64, then going into 64 byte aligned loop. And a bunch of possible return conditions are processed with different return places. There is nothing interesting or new. But pminub as a method of collection of lowest bytes, searching for zero, is interesting, yes, but I'm not sure it is fast on every machine, after all comparsion with zero is simple operation which may have shortcuts in the circuit, but just comparsion with actuall processing of the value of the data, and then rearrange of the bytes in the regs - may be not so fast. So I'm not sure if that even with strings located in the cache 64-byte fetching loop will perform (much) faster then the loop I used in the big algo used in the "Research". The idea is interesting and needs to be tested, but then there is question about than the big algo in the thread is just a prototype and is not actually finished - I was lazy to do that and just had to rewrite other algo which actually has very different design. Being paid for that, people are tend to write miles of code ("the bigger, the better") but here we write just for fun, so I just dirty hacked the algo into testbed - it works properly there, though its alignment fixing has some things to be rewritten completely, and it shows the advantages of 64-bytes fetching inner loop with the more or less long strings (but still fitting into the cache), and it shows the disadvantage of such a big and complex algo when the run go in the real world conditions - the code is huge, the time to develop such code is high (maybe this is the reason? the bigger - the better, yes, the longer the code - the longer you design and develop it - the time is going and the saliry is paid), but the real gain is nothing more than hardware can actually provide for even much more simpler algo.
Title: Re: Research on the algos design
Post by: Antariy on March 26, 2015, 04:50:32 AM
nidud,

You are reffering to an experience - but your experience being even unlimited will not disprove some simple, basical things, like the memory bus speed is, like the caching subsystem is..
Quote
What idea does this research prove.

We all know and like to write the algos and to test them on the lab. The problem is in the method which is used in that. Currently we "test" the algos against short (relatively to the 2nd level data cache, read below for explanation) pieces of data, and the test runs many times on the same short data. Being executed in these conditions, some types of algos do not show the real world performance of their.

After first (at least after couple more) interations of the test - call to the code tested and return from it - the short data is already placed in the cache, so the next iterations of the test are going in an unfair (read below why) conditions for the real world performance of the code. So, just first iteration runs with the real world performance, the next, many thousands of iterations run in not real world but "static" conditions, so in final average results the timing of the first run has no influence to show the real performance.

In the testbed the functions are read into a virtual protected area of the code segment, and they are all called in a validation test before the timing begins. This means they all start with all the data already in the cache.

Again, you missed the point, or, just entangled what was said there and what do you get. Maybe that's a language barriers (especially we both not-native English-speaking I guess).

There is not said that some of the code has advantage over other code, not said that some code runs with the data already in the cache. You probably think that there I was trying to disprove that you 32-bytes fetching algo is faster than 16 bytes fetching? But that was not the point, I don't care if some one writes code faster, nor even care about Intel's strlen, because I just know that in real world condition that is just an exaggerated (with no reason) bunch of bytes the value of whose is exaggerated. The "brand name" "Intel" before the algo doesn't mean I should respect the algo just because it has that brandname on it. After all Intel is known in before in plaing unfair games with AMD with their "optimising" (cuasi-optimising, which optimise only for AMD - refer to Agner Fog's research on that) compiler runtime. So the software they produce doesn't mean the same "mark of the quality" as their hardware is (someone, though, might discuss that as well, there are haters of AMD as well as Intel, but that's out of the point here). They after all produce hardware in first place, and then and only they must force users to think that the hardware is great. So how will they do that? Just write the "impressive looking" code which takes the best from the latest hardware development, but that hides behind the scene much of details. On any place you will heard of that Intel's Silvermont StrLen is the fastest possible, just cries of the fans, but no one of them do not say that that code is just bloated piece of software which in real world will execute the same (if not slower) than much more shorter and simpler algo (your, for an instance). This is "popular culture" inside IT, like it is in any other field.
So there is no any point in that that the Research tries to disprove "speed" (this therm is relative) of some algos or any algos at all. You may be sure that I see that your algo is "faster" (it is faster within the "statical world", the same is Silvermont is faster than your algo in the same "statical world", but in real world you algo performs the same, as Silvermont, and my simple algo performs the same as Silvermont, too) and agree that it's faster with some ideal conditions. I just proved with the research that those ideal conditions will not happen in the real world. Because there are so much discussions about those algos on the forum, that it's just trying to make notes on that, with no try to offend anyone in their will to optimize code.

In the quote above you agree with that "all the already in the cache"? These are you own words, so you agree with them? But that is what this research does prove, and what is the problem - being in the cache the data actually fools the timings, the CPU plays unfair games with the tester of the code. Because CPU core and CPU cache interconnection is not the same as the cache and RAM interconnection. You see the algo performing in ideal conditions which will never happen to it (being said that's the algo which falls under point 3) from the second main post of the thread - and that was underlined many times) when it's executed in the real world machine under multitasking OS and even more over under user mode. Only if your code exclusively own the CPU core with no any other code before and behind it, the cache contains the data which you used last, with the maximal size equal to the size of the cache. But the cache buffer is "circular" - it's limited and new data access will flush the "older" data from it, and remembering that the cache has "linear" (some other kind of "page") architecture, access to the even only one byte in the memory which is not cached in the cache currently, will lead to refill the cache with 32/64 bytes of data from that memory location, starting with the address equal to address_of_the_byte_accesses AND 0FFFFFFC0h. So, under multitasking OS, which simultaneously runs tens of threads, and any of those also acccess the memory which is FOR SURE is not from your process memory location in the RAM - i.e. every process has its own memory mapping into the RAM, so, effectively, the cache size which is in use for your process, roughly saying, equal to the division of the cache size to the number of the processes. Because every process has its own, not overlapping, RAW address of the data in the RAM, so the cache effectively splits its volume, trying to cache all those different addresses which are accesses by a bunch of threads of bunch of processes. So, in the real world the data is actully flushed from the cache very fastly after it was accessed, because there are no reasons to keep it there - no one accesses to the same string constantly and continuously all the time, to keep it in cache for long. But our testbeds do so. So they absolutely different to that conditions in which and how the code will be runned in the real world machine, which, as already been explained, flushes the data from the cache very fast after it was accesses. So, if in your application you do not provide the separate thread which only task is to constantly access the data bytes of the string to keep it in the cache, you may be sure that the string you get in your strlen code is not in the cache in almost all cases being runned under the real world computer. "Almost all" - practically all, the only exception is if you for an instance computed string length and then just after that, with no big delay, do some other things with a string, for an example scan it. But, then, if you scan string, why did you call the strlen? You may do the strlen/zero terminator checking when scanning the string itself, like strchr function does. There are actually no any reasons you may imagine which may be a warranty that your string is in cache and continues to be there all the time - possibility of that is astronomicall small.

So, in the real world the algo (every algo) gets the str pointer to the string which is not in the cache. So, when scanning it the code will have to deal with memory bus speed. But more over the string data may be not in the RAM, it may be in the paging file, and that will slow things down in an incredible extent.

That's the one of the most important reasons why when you run the test only once over the tested piece, the timings are crazly high. And that's why we run the test millions of the times - to avoid the influence of paging and memory bus speed. But those conditions are ideal so the code is tested in syntetical conditions and "on a papeer" being even faster in 10 times than the other algo, in real world you will stumble over the thing that the code runs at the same speed as the more modest and short code. At the point where the every code hits the real world limitations, there is no reason to optimize/unroll further. And being real world code, the code will stumble over the problem of not-fast-as-the-light decoding of the code into microops. So the longer code will have longer stalls even before it will execute. And remember that the code trace cache is even much more smaller than 2nd level data cache, so the code microops are flushed from it practically in one moment after the code leaved that part. The multitasking makes things even harder - too much threads and your code has only small code cache within the size of the time frame of one OS quantum it gives to the thread to run. If some large code crossed the timeframe, then with next turn of the thread's work it will be reloaded and decoded again, so there are possible situations where very advaced code might return results slower than simple and slow GPR code :lol: That's very rate situations, and mentioned just for fun. The code caching problems is not that (almost not at all) important for the tests, as the data caching issues and the syntetical test results.
Title: Re: Research on the algos design
Post by: Antariy on March 26, 2015, 04:52:21 AM


Quote
OK, well, being reffering to your experience just couple of weeks ago you was fully sure that there is no possibility to write a safe algo and that the "deal" of a speed versus robust is acceptable.

Don't remember saying it was impossible or acceptable, but I assumed there was no danger in the environment where the functions was used.

You said that the only safe algo will be byte-by-byte - how else can you treat those words? The same may be spelled like "all not byte-by-byte algos are unsage".

You "said" about acceptability of that in the other manner - you provided the timings where showed that the unsafe version of 32 byte fetching StrLen produces faster results, much faster. And you noted that even algo is unsafe, the speed matters, so that way "deal" in other words, you accepted its unsafety in deal to its speed. Yes, maybe you did not go to use it in the safe environment software, that was not the point, the point was that your experience changed just some days ago - experience and opinion about possibility to implement safe and simultaneously fast algos. So, the reffering to the experience exclusively is not an argument - that's additional point, but not the argument.

Quote
"Experience" is just a word, and it will not disprove basical things like caching/memory bus speed limit/OS paging. We all here may refer to our exclusive and limited (every people has limited experience) experience, but that isn't an argument in a disput, that is just a "side note" like "from your experience something was so".

In this case the "experience" is based on testing with some good results.

For sure the results will be good, where did you see that I said that your safe 32 bytes fetching algo is not good?

The research say only about full nonsense in developing toooooo entangled and toooo bloated/sophisticated algos because all of them will finally hit to the same speed as the other ones. The 68 bytes version above of the single XMM reg using StrLen will perform in the real world the same as over 800 bytes Silvermont - that's the point. Which code is better is just a taste and each own's opinion. The same your code is not slower in real world than Silvermont. But not faster than single reg using - this is astonishing but is true, and was proved by the research. The differences are just some parts of a percents for each code and are actually showing not how the good in thems of an ALU algos perform, but how good the memory bus subsystem performs in relation to the cache, with that method the code uses.

Quote
You missed the main point. You talk about "million times called" functions, but where did you see the application which does this with the same string consequently?

All the time where speed matters  :P

The sarcasm and the "tongue" smiley is not the argument. You are unable to disprove what was said here and you will not ever be able to disprove that - take that or not, every people has its own right to be a fanatic.

 :biggrin:

Sorry, didn't see the tongue there, but one tend to use the same memory in these loops.

You mean the same cache? Yes, all the code on the system used the same cache for the code/data, and that's the point. Your code should always assume that the data is not in the cache when it accesses it first time and not longer than some microseconds (to say it clear - just for one walk through data bytes). But the cache size is limited so it's something like a circular buffer - the oldest data gets rewritten with newest data - so if the code doesn't keep access to the same data all the time it will fast be flushed from the cache. So, the data is "always" in the memory - until the code will access it and walk through it, after that the data is in the cache - if it fits into it. If it doesn't fit there then the only some piece of the data is in the cache, but being executed under multitasking OS even that small piece of the data which contained in the cache will be flushed from there fastly - the cache is very volatile memory storage, if it will keep some data with no reason, this will crazly slow system down. So there is the most frequently accessed data and/or latest accessed data. And that was said: it's unbelievable that there is an app which does strlen for the same string all the time to keep it in the cache. So, strlen algos may freely assume the string passed to them is not in the cache, so it's in the RAM (or even in the paging file), what is the limiter of the algo's speed. In real world probably even single reg MMX strlen will perform the same as Silvermont strlen. The SSE and MMX at all is designed to do some more or less comlex arithmetical actions on the data - the speed up is there, not in actual "wider" size of data access, because in just sequentally accessing algos (with not much of arithmetic in them) the memory becomes the limiter. But in nunber crunching algos SSE/MMX are the solution to speed things up in much more greater factor than the GPR code may do, because not the memory is the limiter here but the lesser data which GPR ALU can process simultaneously. Once the data in the regs/cache, the SSE code may do many complex computations on the packed data and it will do that much faster than GPR code. At the same time the crazy unrolled SSE code like Silvermont one is, will probably perform not more than 2-2,5 times faster than the regular GPR szLen from the MASM32 lib. Even though Silvermont "does processing" in 64 byte chunks and 16 byte regs - szLen does that byte-by-byte, Silvermont does that 16 bytes per operation, but it will be faster not in 16 times as it should look, but only in 2 times faster in the real world. And maybe even slower sometimes because it's much more big than simple GPR szLen.


Quote
Try to find the example of algo which will call the strlen millions of times for the same string - consecuenly. Why the word is underlined? Because of you call it from time to time - it will be just in some milliseconds from the previous call flushed from the cache, so, the next call to the code will again go the slow - memory bus speed - way.

And there I disagree. When you tokenize a text file you tend to reuse the same memory. The same with reading files where you use the same file block which will end up in the cache before you call strlen to get the length of the file. There could be millions of files down the subdirectories, and you may open all of them and do a byte scan and so on.

When you reading the file, it ends up in the RAM (at best, or in the paging file), not in the cache - DMA is thing to free CPU to spend it's resources on some perephiral actions.

After you have read the file, you then pass it to some function which processes the data. But being in the virtual memory of the process, the data maybe not even in the RAM, but in the paging file so it will be very slow to access than data first time after that it was flushed from the RAM to the paging file. OK, well, after some thousands of cycles, the page (only one page, not even entire file's data) with data is in the RAM. Now CPU has access to it - with the speed that RAM bus provides. The code computed string length - the string in the cache after that. But what will you do with that string next? Remember, you have only some microseconds to continue work with string to keep it in the cache - if not then it will be flushed from cache memory. So you have the string length, and for some short time you have the sting itself in the cache - but that is not important for you and more over that has no any sense, because you will next open other file, process it and so on, and every time you'll have to deal with the RAM speed (at best), not internal CPU speed. That string which you was accessed previously will be already long time flushed from the cache when you will read the next string from the file, so the next time you will access than string somewhere in the program, you will again deal with the RAM speed. So, the cache has no any sense in that, so, to do the tests which are close to real world, you should be able to "turn cache off", and the simplest way you may do this - just process the data which size is much more bigger than the cache size, so then even with multiple loops over the data you'll have more or less close to the real timings. "More or less" because even there you will get better results than they are in the real world. Because after first runs of the code over data, the data will be commited by system into RAM (i.e. it will for sure be not in the paging file), so finally you will have the results which are more close to the RAM speed limit than to the real world values (which will be even slower).

An advancement on the testing suggestion which was described in the main post of the thread.
We can lock the data in the memory, so the code which will access it wil deal only with RAM, not paging file, speed. But there are the limits on that how much can you lock in the memory. And before lock one should call GetProcessWorkingSetSize, then add to the results of it the size of the buffer you want to lock in the memory, then call SetProcessWorkingSetSize with those computed value. And only after than one may try to call VirtualLock to lock the buffer/data in the memory. If succeded, then the data/buffer is locked in the RAM - and this part of RAM isn't usable by other processes, that's why this feature is limited for security reasons (if some processes will lock too much memory, then system might just go on low memory and crash or become unusable just like if it had to run on very small amount of RAM). So, practically, we probably cannot assume that we will be able to lock 300...500 MBs of data in the RAM, so, we will have to deal with paging slowdowns in our tests, too, so being tested the, let's say 300 MB of data on the system which may provide this amount of memory (i.e. the 256 and even 512 MB systems are not suitable to provide reliable timings), and 10 times of loop over the code, we will have more or less close to the RAM speed results (as you remember, in the test in research there was 67% of the RAM speed achieved). And this will be more or less close to the real world testing for the algos which fall under the point 3) of the research conclusion.

Quote
You may take that as a rants,

Maybe a bad choice of a word, but there was a lot of text there  :biggrin:

Descriptive explanation is not the same as "rant". I was not insisting to anyone read it - this was written for usefulness to people, and the prople may read or may not read that - that is their right, but being not read it, it is a bit strange to argue with it and try to accuse the writer in rants.

Quote
Well, but this is probably rant, too, so if anyone has right to stick with its opinion - the disput is useless.

No, that was interesting, and I disagree, dispute is not useless.

Is not useless but not very simple especially for non-native English writer.

Title: Re: Research on the algos design
Post by: Antariy on March 26, 2015, 04:54:03 AM
AMD, 2 mb L2 cache

Intel, 1 mb L2, 6 mb L3

(I think)

This testing is a PITA I expect at least a 1500-word post from you after all this trouble :P

(From other thread, Antariy advice:) On AMD use ax with bsf :redface: I should have thought of that

No, the testing is good. Where do you see the PITA? And who is he? :biggrin: BTW what this word means?

From your results there's obvious changement in the timings on 1 to 2 MB cross on the AMD, and the same on between 100 KB and 1 MB on Intel, so the test works well.
Title: Re: Research on the algos design
Post by: Antariy on March 26, 2015, 04:55:55 AM
And now all of you are just obliged to read this all :greensml:
Well, this is good training in english writing, but I see that the more tired you get the worse your english is ::)
Title: Re: Research on the algos design
Post by: dedndave on March 26, 2015, 05:48:46 AM
PITA = pain in the ass   :P
Title: Re: Research on the algos design
Post by: Antariy on March 26, 2015, 06:58:36 AM
PITA = pain in the ass   :P

Thank you for explanation, Dave :P :t
Title: Re: Research on the algos design
Post by: nidud on March 26, 2015, 07:57:05 AM
You said that the only safe algo will be byte-by-byte - how else can you treat those words?

That was the assumption at the time, yes.
Quote
The only safe way will then be a byte scan.

Quote
The same may be spelled like "all not byte-by-byte algos are unsage".

Quote
In the MEM functions it will be possible to avoid this issue given you have the actual size of the buffer, but in the STR functions you need to scan.

The Intel versions do a byte alignment at the top so that was the assumed cure, but the Agner version was better.

Quote
You "said" about acceptability of that in the other manner - you provided the timings where showed that the unsafe version of 32 byte fetching StrLen produces faster results, much faster. And you noted that even algo is unsafe, the speed matters, so that way "deal" in other words, you accepted its unsafety in deal to its speed.

It was more to analyse the situation or "damage" if you will.
Quote
Replay #35 (http://masm32.com/board/index.php?topic=4067.msg43127#msg43127)
Quote
Finally, something you should know: if your routines are called to copy <=15 bytes, when PAGE_NOACCESS protected memory happens to be just beyond the source buffer, they'll blow up :P
Replay #38
Quote
well, good catch  :t

Reply #76 (http://masm32.com/board/index.php?topic=3396.msg43375#msg43375)
Quote
Some notes about the SIMD functions in this tread.

As pointed out by rrr in this post reading ahead more than one byte may read into a protected memory page and blow up.

Quote
Yes, maybe you did not go to use it in the safe environment software, that was not the point, the point was that your experience changed just some days ago -

Opinions may change but experience I think belong to the past.

Quote
Quote
"Experience" is just a word, and it will not disprove basical things like caching/memory bus speed limit/OS paging. We all here may refer to our exclusive and limited (every people has limited experience) experience, but that isn't an argument in a disput, that is just a "side note" like "from your experience something was so".

In this case the "experience" is based on testing with some good results.

In the beginning while testing SIMD functions with Frank (long time ago) I gave up testing because the results were all over the place. Later I manage to get some stable test results and then rewrote the string functions. Testing applications linked to the new library was the real "experience" in this case.

Quote
For sure the results will be good, where did you see that I said that your safe 32 bytes fetching algo is not good?

Have I accused you of saying that ?

Quote
The research say only about full nonsense in developing toooooo entangled and toooo bloated/sophisticated algos because all of them will finally hit to the same speed as the other ones.

If you increase the byte size to a magic number that's seems to be the case. The memcpy test suggest the number is between 1000 to 2000 bytes before the "smart-ass technology" kicks in.
Title: Re: Research on the algos design
Post by: hutch-- on March 26, 2015, 09:25:25 AM
Alex,

One comment on your English practice, use more paragraphs as it breaks up big blocks of text and makes it a lot easier to read.

RE: hardware differences accompanied with big smiles and wise nodding, welcome to the world of massive variation ranging from an Intel Atom to i7 to AMD and all of the old timers before them. Some give up in despair, others try for the magic bullet but most eventually settle for an algo that works on most things OK, has good averages and have an eye on the future hardware instructions.  :biggrin:
Title: Re: Research on the algos design
Post by: Antariy on March 27, 2015, 05:35:48 AM
Alex,

One comment on your English practice, use more paragraphs as it breaks up big blocks of text and makes it a lot easier to read.

RE: hardware differences accompanied with big smiles and wise nodding, welcome to the world of massive variation ranging from an Intel Atom to i7 to AMD and all of the old timers before them. Some give up in despair, others try for the magic bullet but most eventually settle for an algo that works on most things OK, has good averages and have an eye on the future hardware instructions.  :biggrin:

Thank you, Hutch :biggrin:
Title: Re: Research on the algos design
Post by: Antariy on March 27, 2015, 06:28:27 AM
Quote
The research say only about full nonsense in developing toooooo entangled and toooo bloated/sophisticated algos because all of them will finally hit to the same speed as the other ones.

If you increase the byte size to a magic number that's seems to be the case. The memcpy test suggest the number is between 1000 to 2000 bytes before the "smart-ass technology" kicks in.

If you carefully read those big postes than you will see that these "arguments" are not arguments at all. With no dependency to some "magic numbers" there will not be any speedup in real world execution, when you execute algo NOT thousands of times over the same data, constantly. If you do not want to read/get that big posts, then you should at least agree to disagree with them, but not try to "disprove" them being not trying to understand them, otherwise the discussion is just demagogical.
Title: Re: Research on the algos design
Post by: nidud on March 27, 2015, 09:19:55 AM
If you increase the byte size to a magic number that's seems to be the case. The memcpy test suggest the number is between 1000 to 2000 bytes before the "smart-ass technology" kicks in.

This has to do with hardware where new technology do things differently. You may read more about the magic number here (http://masm32.com/board/index.php?topic=3396.msg36384#msg36384). In this test you will see how the SIMD functions are eaten by a simple REP MOVSB at a spesific (magic) byte-size.

Quote
If you carefully read those big postes

:P

Quote
Well, with string sizes smaller than 16 the "fastest" algo is the first one, one-XMM reg using. If the string is 16+ with the terminator it will be 17+, so it will not fit into the one XMM reg so it will require additional execution and iteration, so after than the algo becomes "slower".

The second, complex algo, has higher overhead of complex entry than the third (nidud's) algo, so that's why it's little bit slower on short strings.

This I agree with Alex. It’s logic not magic. And here comes the interesting part:
Quote
Second algo shows the "cost" of the complex entry + huge data fetch code which cannot "pay its cost back" with too short strings.

So here is (in my humble opinion) where you have to do some research:
1. What type of application needs a fast string library ?
2. How is the code/data organized in this application ?
3. What is the average byte-size passed to this function ?

Quote
As you see, the algos almost go "head to head". Starting from the 1 MB of the data and until the 64 megs.

Depending on the answer to point 3 this may be irrelevant for the test, but still useful information.

Quote
1) In real world the data passed to the algos are almost always not in the cache, especially if that's data which doesn't get touched very frequently (many times per second).

This I disagree with as mention above but it depends on the answer to point 2.

Title: Re: Research on the algos design
Post by: Antariy on March 29, 2015, 04:59:28 AM
If you increase the byte size to a magic number that's seems to be the case. The memcpy test suggest the number is between 1000 to 2000 bytes before the "smart-ass technology" kicks in.

This has to do with hardware where new technology do things differently. You may read more about the magic number here (http://masm32.com/board/index.php?topic=3396.msg36384#msg36384). In this test you will see how the SIMD functions are eaten by a simple REP MOVSB at a spesific (magic) byte-size.

I know about rep movsd overspeed in specific circumstances, but that comparsion is irrelevant for this research - here we talk about different things, and those things are "level deeper" than the movsd/movdqa comparsion. Actually you may think that this thread explains why movsd is faster when byte count is high enough - if you did read this thread, you saw that there is written the point which claims that in real world conditions the data which is used by algos which fall under the point 3) of conclusion of this research, is always not in the cache. So, "shortcuts" in the CPU circuit which do non-temporal direct move not messing up the cache is the reason why movsd is faster with higher byte counts. The movsd/stosd with high bytecounts is working bypassing the cache, so you may notice that it works in the conditions which this research forced the algos to work in so they show their drawbacks on too overcomplicated design.

Quote
If you carefully read those big postes

:P

Carefully read, and forget to add: and try to understand before trying to disprove.

Quote
Well, with string sizes smaller than 16 the "fastest" algo is the first one, one-XMM reg using. If the string is 16+ with the terminator it will be 17+, so it will not fit into the one XMM reg so it will require additional execution and iteration, so after than the algo becomes "slower".

The second, complex algo, has higher overhead of complex entry than the third (nidud's) algo, so that's why it's little bit slower on short strings.

This I agree with Alex. It’s logic not magic. And here comes the interesting part:
Quote
Second algo shows the "cost" of the complex entry + huge data fetch code which cannot "pay its cost back" with too short strings.

So here is (in my humble opinion) where you have to do some research:
1. What type of application needs a fast string library ?
2. How is the code/data organized in this application ?
3. What is the average byte-size passed to this function ?

I do not have to do another research just because you did not want to agree with this one. This research said things which it was going to say and is useful for those who want to get it instead to disprove it due to lack of understanding of it. This discussion is something like the "Russel's teapot" question, you want to claim that there's teapot in the orbit between Mars and Earth, but you want from me to prove or to disprove that by other research - strange thing. You may take this as a thought that "you're right and I just do not want to do research because I afraid it will show the contrary to that what I've said on the previous research" - i.e., if you prefer to think that you're right here, think so, but I'm too lazy to do researches I asked for by the people who anyway will not listen to the results if they will not be as expected by those who asked for research. If you're not lazy, well, find a pretty decent real world app, substitute in it the strlen algo it uses, in 3 steps with 3 different algos - single reg XMM using, your 32 byte fetching, and then intel's 64 byte fetching. Each time do a test of the application and then try to find at least 5% difference in its performance. Those who agreed/learned something with this research in this thread - agreed with it, those who do not agree - those do not agree (but you should take your attention that those who agree are in bigger count of persons and even did not try to disprove something, because there is nothing to disput about or to disprove).

Quote
As you see, the algos almost go "head to head". Starting from the 1 MB of the data and until the 64 megs.

Depending on the answer to point 3 this may be irrelevant for the test, but still useful information.

Quote
1) In real world the data passed to the algos are almost always not in the cache, especially if that's data which doesn't get touched very frequently (many times per second).

This I disagree with as mention above but it depends on the answer to point 2.

Again, if you carefully will read the thread, you will notice that there is a very descriptive and simple explanation why you may assume that the algos falling under the point 3) work with the data not in the cache when the data was passed to them in a call. If you don't read or still disagree then I do not have to search for the "proving" why there's no teapot flying round the sun. The demagogic discussions are always built on the long explanations of someone, and the "trolling" from the other side which does not agree but makes short and not argumented statements - and there is no reason to continue to explain the things whose were many times explained in the thread. If you don't agree with those 3 points in the conclusion of the research, then there is nothing to add to the explanations whose were many times repeated. No any other real world testing will disprove any claim made in this research, the syntetical ones are not count. Those who disagree with this should continue to disagree or try to argumentally disprove any claim, otherwise the author of the research do not have to continue the discussion about very simple and descriptive explained research which may not be disproved in principle. Those who disagree with this - have right to disagree silently, otherwise such careless attention to the explanations looks more like trolling.
Title: Re: Research on the algos design
Post by: nidud on March 29, 2015, 06:58:20 AM
if you did read this thread, you saw that there is written the point which claims that in real world conditions the data which is used by algos which fall under the point 3) of conclusion of this research, is always not in the cache.
Quote
3) The algos which do not do more than one iteration over all the data passed to them, taking in account what said in two points above,
Quote
1) In real world the data passed to the algos are almost always not in the cache, especially if that's data which doesn't get touched very frequently (many times per second).
2) The data may even be not in RAM but in the paging file, which makes access to the data even slower.

This I have already answer, so it’s difficult to understand what you really mean here. Given there is no physical connection between the CPU and memory all data will be loaded into the cache. Using large buffers in hope of flushing it may not work out as you think.

Since you can’t bypass the cache my argument was to play along with it. I you paranoid about the same string being in the cache on repeated calls you may try to flip pointers:
Code: [Select]
A db "string A",0
db 32/64/128 dup(0)
B db "string B",0

However, if you look into your own test you will see that even the test macro touches memory during each iteration, so my guess is that the mantra of "million calls with the same…" may not be that much of a problem.

Nevertheless, I was not going to make any arguments abut how the cache works, my argument was your claim about the test done in the forum used "million calls with the same…", which is simply not true.

Quote
So, "shortcuts" in the CPU circuit which do non-temporal direct move not messing up the cache is the reason why movsd is faster with higher byte counts. The movsd/stosd with high bytecounts is working bypassing the cache, so you may notice that it works in the conditions which this research forced the algos to work in so they show their drawbacks on too overcomplicated design.

When the conditions, which "the research forced the algos into", exceeds the magic number you only measure the technology-impact of the inner loop.

This is, as I was saying, interesting, but I'm not sure how relevant this is to testing in general. The research is fine but some of your conclusions I disagree with, especially the part of tests done here in this forum being void as a result of it.
Title: Re: Research on the algos design
Post by: Antariy on March 30, 2015, 08:19:03 AM
if you did read this thread, you saw that there is written the point which claims that in real world conditions the data which is used by algos which fall under the point 3) of conclusion of this research, is always not in the cache.
Quote
3) The algos which do not do more than one iteration over all the data passed to them, taking in account what said in two points above,
Quote
1) In real world the data passed to the algos are almost always not in the cache, especially if that's data which doesn't get touched very frequently (many times per second).
2) The data may even be not in RAM but in the paging file, which makes access to the data even slower.

This I have already answer, so it’s difficult to understand what you really mean here. Given there is no physical connection between the CPU and memory all data will be loaded into the cache. Using large buffers in hope of flushing it may not work out as you think.

You given the anwer, yes, but, still, it's technically incorrect. You do not understand how the CPU-cache and cache-memory subsystem works and look on all of it as just on the "black magic box". "All data" will never be loaded into cache otherwise there is no reason to use RAM, if the cache will be so big to contain entire "all data" of all programs (including the OS itself). The cache is the thing to speed up the access to very frequently accessed data, but not more than that. If you still, after entire thread and its discussion don't see where is the wrong thing in your assumtions and believes, how to prove that to you? If you prefer to decide that the tests used on the forum are right thing, and if you prefer to think that when you optimizing your code based on those timings - well, that is your right to do your optimization and testing as you want. But, remember, you will never be able to disprove the things which said in this research - nobody will be able to disprove that, and this is not pride speech, if you don't believe - disprove that or find someone who fill able to disprove that but you will search very long and finally do not find such someone. So it's better to spent that time optimizing your code in a ways you think are right. After all, optimizing your code based on the testbeds used on the forum will give you vague, but more or less right way to optimize code and the results you will get from that will be good. The thing which this research said was that you may receive the same good final results with not too-much-fanatic optimization tries.


Since you can’t bypass the cache my argument was to play along with it. I you paranoid about the same string being in the cache on repeated calls you may try to flip pointers:
Code: [Select]
A db "string A",0
db 32/64/128 dup(0)
B db "string B",0

However, if you look into your own test you will see that even the test macro touches memory during each iteration, so my guess is that the mantra of "million calls with the same…" may not be that much of a problem.

Nevertheless, I was not going to make any arguments abut how the cache works, my argument was your claim about the test done in the forum used "million calls with the same…", which is simply not true.

This in not I paranoid - this is you're stubborn on your own and thus act ignorantly. The example with swapping two pointers to very short strings, which will obviously still fit into the cache, and being executed on the, well, ok, no million, but half of million times over each string, you will execute the test in the conditions which will never happen to the code in the real world.

I don'n understand what you meant about "test macro", still, if you take your notice not too carelessly into the tests, you will see that the second tests, which showed the similarness of the timings for every algo tested, has only 10 iterations over the same string with the size bigger than the cache. And you may even decrease that number even lesser. Still, you continue to use citations and phrases without understanding what was said there, but do you really think this is a valid argument to argue or try to claim that the claims said in the research are not true? That's ignorance.

Well, you was not going to make any arguments (is not this sounds funny in the discussion, which is only based on arguments? if not on arguments, then on which basis then will you construct your disput?), well, ok, you do not make any arguments, but you disprove the claim of the research just because you "don't agree". Did not you see that this behaviour is the behaviour of fanatic belief in some thing? You will not make arguments, but you'll belive in what you think and that's all for you.

If you don't understand what is said here - it may be ignorance, or just attentionless, or just lack of understanding, then it is even more ignorant to try to disprove or make your own claims about "trueness" or not "trueness" of the research, and the citation about "millions of calls". If you don't understand what was in base of the research and if you don't believe - then this is your right to do your work as you want for yourself. But it looks like it is you who repeats to use some cications like mantras without understanding of them and the underlying things on which they are based. And the paranoid desire to (dis)prove something about the points you did not understand. If you think that your example with two pointers "swapped" is getting any closer to the real world - well, continue to believe in that. But the test with millions of the different (even short) strings passed to the algo consecuently will be the only test which will show the reasl peformance of your algo - that's what was said in the research, that's what was said by Hutch, too, and that's what was implemented by Jochen in a code using 200 MB of strings to test over strlen, and it showed the claims here true. But, still, these are not arguments which you, in your fanatic beliefs, make take your notice to.

Quote
So, "shortcuts" in the CPU circuit which do non-temporal direct move not messing up the cache is the reason why movsd is faster with higher byte counts. The movsd/stosd with high bytecounts is working bypassing the cache, so you may notice that it works in the conditions which this research forced the algos to work in so they show their drawbacks on too overcomplicated design.

When the conditions, which "the research forced the algos into", exceeds the magic number you only measure the technology-impact of the inner loop.

This is, as I was saying, interesting, but I'm not sure how relevant this is to testing in general. The research is fine but some of your conclusions I disagree with, especially the part of tests done here in this forum being void as a result of it.

Those "magic numbers" are nothing real with small buffers - you just measure the speed of the CPU-cache interconnection, those "magic numbers" are real only with larger strings and are not something "black boxed" but just a shortcuts in the CPU design which come to work only after some, specific for the design, byte count, and take notice - only after big bytecount, and the bigger the bytecount is, the more performant the code (MOVSD) is - the bigger the difference between SSE and MOVSD code, but, still, using some techniques which avoid cache trashing with moves, you may get the same performance with SSE code. Though, again, real world code then has no any gain from using complex and long SSE code, to get the same performance after all, as the rep movsd will provide. SSE is for number crunching - and there it might has advantages of cache help over the small data sets, and even without that help using big data sets the SSE still have gains just of more wider ALU arithmetic. But you may notice then, then for StrLen code its speed is only 2 times higher than regular non-scasb GPRs code in real world conditions. And being even 10 times overcomplicated and quasi-sophisticated, the code will still not be any faster than that in the real world. But in the testbeds used on the forum the code wil be faster even in 10 times, and the code writer will think that the development time he put into the code is not useless, even though the code in real world will not be faster more than 2 times than GPRs code. This is what you believe in - the idol of forums testbed's "millions of loops over the same short data". This idol is false, and in the real world and in any real app the code posted page earlier, being long at 68 bytes, will perform the same speed as the Intel's 800+ bytes code, you, maybe, find this "interesting", but you still continue to believe in your idol of testbed's timings. Well, there you might see the "magic promises" from your belief, but in the real application you will see the speed up which will be limited to the things you do not want to know of (instead of that you want to "disprove" them).
Title: Re: Research on the algos design
Post by: jj2007 on March 30, 2015, 10:30:18 AM
Given there is no physical connection between the CPU and memory all data will be loaded into the cache. Using large buffers in hope of flushing it may not work out as you think.

Since you can’t bypass the cache my argument was to play along with it. I you paranoid about the same string being in the cache on repeated calls you may try to flip pointers:
Code: [Select]
A db "string A",0
db 32/64/128 dup(0)
B db "string B",0

I am afraid that Alex is right here. There is a reason why I am reading multiple copies of bible.txt in the fat files thread (http://masm32.com/board/index.php?topic=4119.0); and the reason is precisely that modern CPUs have several data caches - see Multi-level caches (http://en.wikipedia.org/wiki/CPU_cache#Multi-level_caches). Level 1, the fastest, was 8k in a P4; today, the Haswell architecture (http://en.wikipedia.org/wiki/Haswell_%28microarchitecture%29) features three levels:
Quote
L1 cache    64 KB per core
L2 cache    256 KB per core
L3 cache    2 MB to 20 MB shared

It is obvious that even if you swap pointers, your complete data are in L1. If you use instead a 200MB file, you may still have some minor influence of L3 (typically 2MB nowadays, 1% of 200MB), but the influence of L1 and L2 becomes negligible. This is why timings differ dramatically for a) multiple access to a small memory area in cache vs b) few accesses to a big memory area that must almost completely shuffled into the cache before it can be used by the CPU.
Title: Re: Research on the algos design
Post by: nidud on March 30, 2015, 11:57:18 AM
You given the anwer, yes, but, still, it's technically incorrect. You do not understand how the CPU-cache and cache-memory subsystem works and look on all of it as just on the "black magic box".

The cache system is developed by the same people who designed the standard IO-stream used in most (if not all) software today. The cache is improved over time in both ends, where faster file and CPU cache speeds up the process of streaming.

The problem with these streams is usage. They are in all textbooks and samples, and in all standard libraries. They use a small buffer (default is one page) by design and normally fetch bytes in a sequence. The file cache then pays the cost of filling the buffer, and the size of the current work set fits very well in the CPU cache.

This isn't necessarily logical compare to allocate a few hundred megs to fit the whole file in memory instead of reading small chunks, but given both software and hardware development favour this type of behaviour the streaming continues.

This means that most binaries that swallow text files use this method, and that will be the applications that need a fast string library. This is the reason why most of these functions are written in assembly, and given the variation in the catch systems often custom made for each CPU. The aim of the testing will then be as Hutch said to find one who works OK in most cases.

Quote
"All data" will never be loaded into cache otherwise there is no reason to use RAM, if the cache will be so big to contain entire "all data" of all programs (including the OS itself).

Yes, all data must be loaded into the cache as explained above. However, the cache system don't know the size of your buffer, so there is a lot of guesswork done there. Same with the cache line fed to the CPU where only the first chunk of data is real and the rest is guessed.

Quote
The cache is the thing to speed up the access to very frequently accessed data, but not more than that. If you still, after entire thread and its discussion don't see where is the wrong thing in your assumtions and believes, how to prove that to you?

My assumptions is explained above so I disagree with your conclusion regarding the definition of a "real world" applications use of the cache. My take is that applications with heavy use of string functions as explained above most likely will (by design) always work on cached data.

Quote
If you prefer to decide that the tests used on the forum are right thing, and if you prefer to think that when you optimizing your code based on those timings - well, that is your right to do your optimization and testing as you want. But, remember, you will never be able to disprove the things which said in this research - nobody will be able to disprove that, and this is not pride speech, if you don't believe - disprove that or find someone who fill able to disprove that but you will search very long and finally do not find such someone. So it's better to spent that time optimizing your code in a ways you think are right. After all, optimizing your code based on the testbeds used on the forum will give you vague, but more or less right way to optimize code and the results you will get from that will be good. The thing which this research said was that you may receive the same good final results with not too-much-fanatic optimization tries.

As I said before I don't disagree with the research but with some of the conclusions you made from the result.

Quote
This in not I paranoid - this is you're stubborn on your own and thus act ignorantly. The example with swapping two pointers to very short strings, which will obviously still fit into the cache

The catch line fed to the CPU contains "guessed" data, and it may be 32/64/128 byte long. Two small strings will then fit in there given the are close. The only way to disable the cache (in this case from using this data) is to flush the line. This is done by putting some garbage between the strings.

Quote
and being executed on the, well, ok, no million, but half of million times over each string, you will execute the test in the conditions which will never happen to the code in the real world.

Well, if you search a million files for the word "million" the same string is touched millions of time in the process. However, the point is that the dataset you work with has a fixed address and (small) size and thereby more predictable and stable.

Quote
I don'n understand what you meant about "test macro",

The cache has many layers but the one close to the CPU is the one with the current data executed. This is loaded when memory is touched, and if this has a different address then previous the current cache line is flushed. If the same (or spatial) address is loaded there may be a copy of this chunk in the current line which will be reused.

Quote
still, if you take your notice not too carelessly into the tests, you will see that the second tests, which showed the similarness of the timings for every algo tested, has only 10 iterations over the same string with the size bigger than the cache.

As said before it will all be loaded into the cache regardless how big it is.

Quote
If you think that your example with two pointers "swapped" is getting any closer to the real world - well, continue to believe in that.

If you have any other method of disable the cache I'm all ears.  :P

Quote
Those "magic numbers" are nothing real with small buffers - you just measure the speed of the CPU-cache interconnection, those "magic numbers" are real only with larger strings and are not something "black boxed" but just a shortcuts in the CPU design which come to work only after some, specific for the design, byte count, and take notice - only after big bytecount, and the bigger the bytecount is, the more performant the code (MOVSD) is - the bigger the difference between SSE and MOVSD code, but, still, using some techniques which avoid cache trashing with moves, you may get the same performance with SSE code.

Think you missed the point. It's MOVSB which is the fastest in this case. If you run the test on your CPU (or mine) the result would be opposite, so this has to do with new technology. If you are going to use real math to calculate the behaviour of the cache you need to use a specific CPU. The behaviour is different from each model and brand.
Title: Re: Research on the algos design
Post by: nidud on March 31, 2015, 05:56:36 AM
There is a reason why I am reading multiple copies of bible.txt

  :biggrin:

Quote
modern CPUs have several data caches

So what?

The transfer speed between memory is interesting but I made a point of the fact that this has no bearing on test results with regards to testing algos.

Slow transfer speed: no problem - same result.
Fast transfer speed: no problem - same result.

Given the testbed is testing software and not hardware the only problem will be an unstable transfer speed.

Quote
It is obvious that even if you swap pointers, your complete data are in L1.

When data is transferred from memory to cache a chunk of data corresponding to the chip set will be copied. This is a fast way of copy "a section of the chip" if you will, or a "cache line".

The size of this chunk I think is related to the size of the bus, so a 64-bit bus will have a chunk size of 64 byte.

Code: [Select]
movdqa  xmm0,[eax] ; load 64 byte - memory
movdqa  xmm1,[eax+16] ; use cache line +16

Quote
If you use instead a 200MB file, you may still have some minor influence of L3 (typically 2MB nowadays, 1% of 200MB), but the influence of L1 and L2 becomes negligible. This is why timings differ dramatically for a) multiple access to a small memory area in cache vs b) few accesses to a big memory area that must almost completely shuffled into the cache before it can be used by the CPU.

True, using the cache is faster. However, in one call (assuming 200M long string) none of the fetched lines will be reused but nevertheless pushed into the cache.
Title: Re: Research on the algos design
Post by: hutch-- on March 31, 2015, 10:44:02 AM
I wonder at some of the comments in this enduring debate. The vast majority of relevant information is contained in data and code is only a factor if it is big enough to break the code cache, a situation that is best avoided for maximum performance. Now depending on the age and type of hardware, there have been different strategies to load data from main storage into the processing component of hardware and we know that in any of the more recent processors that you have a sequence of caches that increase in speed and decrease in size as they get closer to the data processing component of the hardware.

Now if its data that is specific to the algorithm (lookup table or similar) you need to keep its size down and have it accessed so that it remains as close as possible to the processing component of the hardware. The rest is data that you are going to process. You shift from data storage to a succession of data caches that become progressively closer to the processing component of the hardware. This was done historically to improve the throughput of data rather than have to keep accessing the data in main storage or as a single block of allocated memory.

Now apply that to testing algorithms and you have some guidelines as to what you need to do to create a test for different algorithms that are being applied to the same task. At its simplest if you need to test combinations of mnemonics without regard to the data being processed then short loops in level 1 cache can do the job OK as long as you understand the limitation of this testing technique.

At the other end of the scale, if you have to process very large data in a linear manner, you need to construct a test that emulates the conditions you will use the algorithm for. A gigabyte sized data file does the job here as it emulates the cache load and in some cases the write back of data to either the same memory or an allocated block at another address. Another consideration here is the option in later SSE instructions to make non temporal writes to avoid two way cache pollution. There are contexts where cached reads and non temporal writes give big performance gains if executed properly.

Between these two ends if a massive range of different tasks that differ in terms of data access and iteration rate. A reasonable amount of code is called at an incidental level rather than high frequency hammering and the test here is attack, not necessarily linear read time. Designing a test for this type of condition is a lot harder than it looks because while you may be able to design an algorithm that is a super star in the test bed, unless you match the algorithm to the task being performed it runs the risk of getting its ass kicked by another algorithm that better matches the task.
Title: Re: Research on the algos design
Post by: jj2007 on March 31, 2015, 09:50:56 PM
True, using the cache is faster. However, in one call (assuming 200M long string) none of the fetched lines will be reused but nevertheless pushed into the cache.

And this is the exact reason why processing already cached data (e.g. two little strings) is different from processing 200MB that must be pushed into the cache before they become available to the CPU.
Title: Re: Research on the algos design
Post by: nidud on April 01, 2015, 02:04:49 AM
And this is the exact reason why processing already cached data (e.g. two little strings) is different from processing 200MB that must be pushed into the cache before they become available to the CPU.

According to your philosophy of "two little strings" they may each be two million bytes long and still fit in a 4M cache, but we speak of different things here.

The pushing in this case happens after you finish processing the data you already have copied from memory. It shows that if you don’t play along with the system you will be punished.

Quote
If you use instead a 200MB file, you may still have some minor influence of L3 (typically 2MB nowadays, 1% of 200MB), but the influence of L1 and L2 becomes negligible.

Think you find it will be the other way around. All chunks from the 200M-buffer will be copied and processed in the fast lane. It’s only the waste product that will populate (and overwrite) the rest of the cache.

When you stream the data the same thing happens. Data will be copied from the buffer (memory) and processed in the fast lane and will occupy 4096 byte of L1. The difference is that when the stream refills the buffer, the content of this memory region change, and the corresponded region in L1 will be invalidated and not populated into the rest of the cache.

Another consideration here is the option in later SSE instructions to make non temporal writes to avoid two way cache pollution. There are contexts where cached reads and non temporal writes give big performance gains if executed properly.

If the work set of data in the application fits well in the cache there may be a performance gain in just not overwrite it. There are some library functions that do this but I have no idea how they work. Maybe they do a block copy and trash the source as they go along.
Title: Re: Research on the algos design
Post by: jj2007 on April 01, 2015, 05:52:49 AM
The pushing in this case happens after you finish processing the data you already have copied from memory.

Oh really?

include \masm32\MasmBasic\MasmBasic.inc      ; download (http://masm32.com/board/index.php?topic=94.0)
.data
S1      db "Hello, this is a simple string intended for testing string algos. It has 132 characters without zero, as the text file loaded below.", 0
S2      db "Hello, this is a simple string intended for testing string algos. It has 132 characters without zero, as the text file loaded below.", 0
  Init
  Recall "Bible200.txt", L$()
  mov ebx, eax      ; #of strings
  PrintCpu 0
  Print "Little exercise showing the influence of the cache:"
  REPEAT 3
      xor ecx, ecx      ;string counter
      xor edi, edi      ; len counter
      NanoTimer()
      .Repeat
            add edi, Len(L$(ecx))
            inc ecx
      .Until ecx>=ebx
      Print Str$("\n%i ms to get the length of ", NanoTimer(ms)), Str$("%i strings", ecx), Str$(" with %i total bytes", edi)
     
      xor ecx, ecx      ;string counter
      xor edi, edi      ; len counter
      NanoTimer()
      .Repeat
            add edi, Len(offset S1)
            inc ecx
      .Until ecx>=ebx
      Print Str$("\n%i ms ", NanoTimer(ms)), Str$("to get %i times the length of 1 string", ecx), Str$(" with %i total bytes", edi)
     
      xor ecx, ecx      ;string counter
      xor edi, edi      ; len counter
      NanoTimer()
      .Repeat
            add edi, Len(offset S1)
            inc ecx
            add edi, Len(offset S2)
            inc ecx
      .Until ecx>=ebx
      Print Str$("\n%i ms ", NanoTimer(ms)), Str$("to get %i times the length of 2 strings", ecx/2), Str$(" with %i total bytes\n", edi)
  ENDM
  Inkey "And now??"
  Exit
end start


Output:
Code: [Select]
Intel(R) Core(TM) i5-2450M CPU @ 2.50GHz
Little exercise showing the influence of the cache:
234 ms to get the length of 6076600 strings with 803401800 total bytes
67 ms to get 6076600 times the length of 1 string with 802111200 total bytes
69 ms to get 3038300 times the length of 2 strings with 802111200 total bytes

236 ms to get the length of 6076600 strings with 803401800 total bytes
67 ms to get 6076600 times the length of 1 string with 802111200 total bytes
68 ms to get 3038300 times the length of 2 strings with 802111200 total bytes

237 ms to get the length of 6076600 strings with 803401800 total bytes
66 ms to get 6076600 times the length of 1 string with 802111200 total bytes
68 ms to get 3038300 times the length of 2 strings with 802111200 total bytes
Title: Re: Research on the algos design
Post by: nidud on April 01, 2015, 07:17:10 AM
The pushing in this case happens after you finish processing the data you already have copied from memory.

Oh really?

 :biggrin:

Yes, really.

In addition to the slow copy from memory to cache you are also hit by a massive reshuffle of the cache in the same process. You’re being punished.
Title: Re: Research on the algos design
Post by: jj2007 on April 01, 2015, 08:48:40 AM
Believe what you want to believe. Why am I wasting my time here? ::)
Title: Re: Research on the algos design
Post by: rrr314159 on April 01, 2015, 09:41:58 AM
Why am I wasting my time here? ::)

... when I could be wasting it somewhere else? :biggrin:
Title: Re: Research on the algos design
Post by: nidud on April 01, 2015, 09:51:32 AM
How about believe in your own test results?

A 8M buffer will overflow a cache of 4M in a loop. The last cache line in the loop will flush out the last line of the first part.
Code: [Select]
.data?
long db 800000h dup(?)

Using the same lengths as before
Code: [Select]
.data
length_table label dword
dd 1,5,9,16,19,22,28,36,41,35,27,21,18,15,8,4,2
; dd 42,45,48,49,56,59,62,68,76,81,75,67,61,58,55,50,47,44
; dd 600,601,605,609,716,719,822,828,836,999,905,827,821,718,715,608,604,602
dd 0

Make strings from table
Code: [Select]
lea edi,long
mov ecx,size_d
mov eax,'x'
rep stosb
mov byte ptr [edi-1],0
lea edi,long
l1:
lea esi,length_table
l2:
lodsd
test eax,eax
jz l1
add edi,eax
cmp edi,offset endlong
jnb l3
mov [edi],cl
jmp l2
l3:

Loop through the 8M buffer x times
Code: [Select]
counter_begin 1000, HIGH_PRIORITY_CLASS
mov edi,count
mov ebx,esp
l1:
lea esi,long
l2:
push esi
call ProcAddress
mov esp,ebx
lea esi,[esi+eax+1]
cmp esi,offset endlong
jnb l1
dec edi
jnz l2
counter_end

The result for each of the lengths
Code: [Select]
result: string length 2..40 - no cache
   424283 cycles - (105) proc_1: SSE 32
   433102 cycles - (104) proc_2: AxStrLenSSE
   445574 cycles - (673) proc_8: SSE Intel Silvermont
   474455 cycles - ( 86) proc_3: AxStrLenSSE (rrr)
   535386 cycles - (  0) proc_0: crt_strlen
   563018 cycles - (818) proc_9: SSE Intel Atom

result: string length 40..80 - no cache
   493896 cycles - (105) proc_1: SSE 32
   497500 cycles - (673) proc_8: SSE Intel Silvermont
   552300 cycles - (104) proc_2: AxStrLenSSE
   580440 cycles - ( 86) proc_3: AxStrLenSSE (rrr)
   740700 cycles - (818) proc_9: SSE Intel Atom
  1002596 cycles - (  0) proc_0: crt_strlen

result: string length 600..1000 - no cache
  3715463 cycles - (105) proc_1: SSE 32
  4055950 cycles - (673) proc_8: SSE Intel Silvermont
  4063407 cycles - (104) proc_2: AxStrLenSSE
  4132161 cycles - ( 86) proc_3: AxStrLenSSE (rrr)
  4496775 cycles - (818) proc_9: SSE Intel Atom
  7245457 cycles - (  0) proc_0: crt_strlen

Using the whole buffer
Code: [Select]
result: string length 8M - no cache
  6236043 cycles - (673) proc_8: SSE Intel Silvermont
  6317136 cycles - (818) proc_9: SSE Intel Atom
  7438441 cycles - (105) proc_1: SSE 32
  8087361 cycles - (104) proc_2: AxStrLenSSE
  8191742 cycles - ( 86) proc_3: AxStrLenSSE (rrr)
 14104841 cycles - (  0) proc_0: crt_strlen

Here the same test using small strings (cached). Note that the difference in real time between these tests will be large, this one obviously much faster. The result will differ depending on how the whole loop take advantage of the cache, so this will tilt the algos a bit.

Code: [Select]
result: string length 2..40 - using cache
   263056 cycles - (673) proc_8: SSE Intel Silvermont
   282322 cycles - (105) proc_1: SSE 32
   327211 cycles - (104) proc_2: AxStrLenSSE
   330122 cycles - ( 86) proc_3: AxStrLenSSE (rrr)
   373559 cycles - (818) proc_9: SSE Intel Atom
   424289 cycles - (  0) proc_0: crt_strlen

result: string length 40..80 - using cache
   330421 cycles - (105) proc_1: SSE 32
   389739 cycles - (673) proc_8: SSE Intel Silvermont
   456273 cycles - (818) proc_9: SSE Intel Atom
   469557 cycles - (104) proc_2: AxStrLenSSE
   482406 cycles - ( 86) proc_3: AxStrLenSSE (rrr)
   939124 cycles - (  0) proc_0: crt_strlen

result: string length 600..1000 - using cache
  1327363 cycles - (673) proc_8: SSE Intel Silvermont
  1479271 cycles - (105) proc_1: SSE 32
  1522073 cycles - (818) proc_9: SSE Intel Atom
  2040309 cycles - (104) proc_2: AxStrLenSSE
  2084028 cycles - ( 86) proc_3: AxStrLenSSE (rrr)
  5785043 cycles - (  0) proc_0: crt_strlen

The question was which of these two conditions was the most realistic one. I assumed the latter.