News:

Masm32 SDK description, downloads and other helpful links
Message to All Guests
NB: Posting URL's See here: Posted URL Change

Main Menu

Prefetch stalls and queue length tests

Started by Antariy, January 16, 2015, 06:47:37 AM

Previous topic - Next topic

Antariy

This is tests for the stalls checking and for the prefetching length checking. The results from different machines are appreciated :t

The code is self-explanatory, if one wants to rebuild the programs then one needs to specify /SECTION:SEC1,rwe arguments in the linker's command line.

Patch_Stalls_Test.exe checks how long are the stalls in the code which gets modified. For the simplicity the modification code and the stalls checking code is the same code and located in the same page, but one can run the modification code from other page and see if the patching the code itself slowing the pathing code down, to do this need to specify 1 as a command line argument to the EXE. Also it's possible to redirect modification from the page where the code runs to the other page - in the data section, and see that the patching of the executable page itself is not slowing down the code, until the code which runs located in the space of prefetching length - i.e. just compare the results of this run and run when 1 is specified as a command line. To redirect modification to the data section - specify 4 as the command line argument.

To simplify the run, run the Run.bat (how many "run" words nearly :biggrin:), but the second executable (read below) needs to be runned manually from command line, to redirect the output.

The results:

Test with code self-patch the page where it runs
Press a key after number appears
cycles for SEC1: 5922186048

Test with the code from the same page but patching the data in the other section

Press a key after number appears
cycles for SEC1: 104518976

Test with call to the external page which patches the other page with the code
Press a key after number appears
cycles for SEC2: 105681304

Test with call to the external page which patches the data from the other page
Press a key after number appears to EXIT
cycles for SEC2: 104268832


How to read the results: First test is the test where the code which is running is located in the same page and near to the patched place, so it's in the prefetching queue, the patching code is the same code for simplicity (the program was intended to run on old machines as well, so real multithreading is not actually required), but this has no difference. One may see that the code runs very slow - in comparsion with the second test.

Second test is the run of the same code but with redirection to the page in the data section. It runs more than 50 times faster, so that may be rougly called a stall of a patch - 50+ cycles for full/partial (depends on the implementation and is not known without project documentation) refill of the queue.

Third test is the patching of the executable page, but the patching code is located in the other executable page, and the results are the same as in the second test, so this shows that the fact of the EXECUTABILITY of the pages of the section is not the reason why CPU does this, it refills the code not by executable flag, but just in the prefetching queue. So, if the patched place (code or data) is located outside queue, there will not be a stalls when patching. To check how long is the queue - run the second executable from archive, read a bit below.

Fourth test is just "intermix" of second and third tests - the code running from an other page than that which is the main testing piece, patches the byte in the page of data section, which has no executable mark, and the speed is the same as in 2 and 3 tests, so this is just one more example that the executability mark is not the reason of the refilling - the reason is the place which CPU runs currently, the queue of that place, and it watches only this place, so there is no stalls when patching outside the queue.


So, if the CPU does refill of the prefetch queue after patch, this test will show it as it runs in protected mode with page addressation, in Windows environment. To check that just enough to see to the first two tests, if the numbers are drastically different, then the CPU does refill (even if in the real-mode it doesn't), if the numbers are nearly equal - the CPU doesn't refill.


To check how long is the queue, the second program from the archive, Patch_Stalls_Test_Prefetch_Length.exe, was written.

It runs the same test as in the first program, but sequentally increments the byte pointer, at which it patches the page. The idea is that when the pointer will be out of the scope of the prefetching queue, the CPU will not refill it again after every patch, so there will not be a slowdown, and this will be shown in the results as a drastical difference in the timings of the running code.

The output of the program is huge, so, please redirect its output to the file, then open it with text editor, and find where the notable difference in the timings occur - it is very notable and looks like first there are a bunch of the similar numbers after "cycles:", and then other bunch of the other numbers that have much less value. Please, copy the lines near of the difference "leap", and paste it to the forum, also you may save your text file with the name of your CPU, zip it and attach to the post, if you see some interesting patterns in the difference - maybe some CPUs have some "interleaving" in the queue, for an instance, if so - the entire file will be interesting.

The results cut:

Address: 004023FE, cycles: 035976F0
Address: 004023FF, cycles: 0355BF18
Address: 00402400, cycles: 000F4680
Address: 00402401, cycles: 000FDDC0


The page starts from 402000, so for Celeron D310 it's 400h bytes of the queue prefetching length, and the CPU maps that not from the EIP value, but from the start of the page - we can see that because the patching beyond 400h (402400 and above) does not cause stalls, and knowing that the testing code is in the same page and actually it takes some place of it, so the offset of the patching space is not zero, relatively to the start of the page, but the queue has "round" value - 400h, so CPU prefetches not relatively from offset of the currently running code, but relatively of the page start.

Also some side notes: from knowing the queue length it is possible to solve the "code placement" problem - now it is possible to determine, how long is the queue, and the address of the tested code, to check, what is near it in the page.

But the proposal I was suggesting: to use different code sections for every testing algo in the sources, solving this problem without computations - every algo will be just in its place, and prefetching queue will be filled with its code.

As for the unconditional jump over the data placed in the code - it doesn't help to avoid stalls - the CPU just refills the prefetching queue when the patch was in the current queue place.

dedndave

P-4 Prescott w/htt @ 3.00 GHz
Test with code self-patch the page where it runs
Press a key after number appears
cycles for SEC1: 6016809188

Test with the code from the same page but patching the data in the other section

Press a key after number appears
cycles for SEC1: 101633866

Test with call to the external page which patches the other page with the code
Press a key after number appears
cycles for SEC2: 101628263

Test with call to the external page which patches the data from the other page
Press a key after number appears to EXIT
cycles for SEC2: 101085968

Antariy

Thank you, Dave! :t

How long is queue of you CPU? The second exe from the archive, Patch_Stalls_Test_Prefetch_Length.exe, reports that, just redirect its output to the file and attach it to the post, if you would like to test that.

dedndave


hutch--

Alex,


Test with code self-patch the page where it runs
Press a key after number appears
cycles for SEC1: 116407773
Test with the code from the same page but patching the data in the other section
Press a key after number appears
cycles for SEC1: 114670101
Test with call to the external page which patches the other page with the code
Press a key after number appears
cycles for SEC2: 115656843
Test with call to the external page which patches the data from the other page
Press a key after number appears to EXIT
cycles for SEC2: 115897284


Hint, when redirecting output, for messages that should be seen on the screen while it running, use STDERR. The StdOut redirects to output to file.

dedndave

i ran that program - it runs 50% CPU for a while - then drops to 0 and seems to hang
the resulting text file is 148 kb

Address: 00402041, cycles: 041C75E9
Address: 00402042, cycles: 0421D890
Address: 00402043, cycles: 042B5EDA
Address: 00402044, cycles: 04265AC6
Address: 00402045, cycles: 043B63BE
Address: 00402046, cycles: 04210816
Address: 00402047, cycles: 04221CF9
.
.
Address: 004023FE, cycles: 0428C2DB
Address: 004023FF, cycles: 041E7964
Address: 00402400, cycles: 000F45EE
Address: 00402401, cycles: 000FBC59
.
.
Address: 00402FC9, cycles: 000F450E
Address: 00402FCA, cycles: 000FBB4C
Address: 00402FCB, cycles: 000F43FF
Address: 00402FCC, cycles: 000F44D9
Address: 00402FCD, cycles: 000FBBF8
Address: 00402FCE, cycles: 000F441E
Address: 00402FCF, cycles: 000F44BB
Address: 00402FD0, cycles: 000FEB5F
Address: 00402FD1, cycles: 000F43D2
Address: 00   <------------------------------------ hung

Antariy

Quote from: dedndave on January 16, 2015, 08:13:42 AM
that EXE runs forever - lol

No, it just runs a long time :biggrin: Just wait a couple of minutes (5 mins I think is the maximum for your CPU), if that's not too boring.
May you change the variable TEST_COUNT to the smaller value, for an instance to the 10 times smaller than it is, and recompile the program? And the test will run faster. Don't forget to set the /SECTION argument to the linker, as mentioned in the first post.

The point is the code checks every byte of the page, so there is ~4000 checks, and the same count of the lines in the output file. It doesn't hang, but just runs slowly.

dedndave


Antariy

Quote from: hutch-- on January 16, 2015, 08:16:08 AM
Alex,


Test with code self-patch the page where it runs
Press a key after number appears
cycles for SEC1: 116407773
Test with the code from the same page but patching the data in the other section
Press a key after number appears
cycles for SEC1: 114670101
Test with call to the external page which patches the other page with the code
Press a key after number appears
cycles for SEC2: 115656843
Test with call to the external page which patches the data from the other page
Press a key after number appears to EXIT
cycles for SEC2: 115897284


Hint, when redirecting output, for messages that should be seen on the screen while it running, use STDERR. The StdOut redirects to output to file.

Interestingly - your CPU doen't stalls at all after a patch the location near to the code running. Maybe that's an advanced in the CPU logic which clearly determined that the location patched is not a part of the code.

The program doesn't report anything except the results dump, so there was nothing to report about, but it was needed to mention that the code will run long enough and might seem as hunged.

Antariy

[quote author=dedndave link=topic=3960.msg41646#msg41646 date=1421356918]
i ran that program - it runs 50% CPU for a while - then drops to 0 and seems to hang
the resulting text file is 148 kb

Address: 00   <------------------------------------ hung

[/quote]

Oh, that seems as the some part of the output stdout buffer was not dumped into file - that maybe a MSVCRT problem, or maybe you finished the program hitting Ctrl+C/Break? After CPU usage drops to zero, just press any key in console window, so the program will finish and the buffer will be dumped to the disk by MSVCRT properly.
It was required to mention that in the first posts ::) Or maybe remove the waiting for the key in the program, and just make a beep, also maybe user input for the loops count to adopt to older CPUs. Well, first release was "beta" :greensml: Joking apart it's working just missing some things to be comfortable to use.

dedndave

invoke crt__getch           ;<------- it's waiting for a keypress - lol
invoke crt_exit,edi



Antariy

Dave, you've done it - even with not fully dumped buffer the results are there:


Address: 004023FE, cycles: 0428C2DB
Address: 004023FF, cycles: 041E7964
Address: 00402400, cycles: 000F45EE
Address: 00402401, cycles: 000FBC59


So your CPU with this method shows the same queue length as mine, but probably on modern CPUs the method will not work because the CPU detects that the part changed isn't the code, but it's interesting to see.

dedndave



you're supposed to press a key, Chris !!!!

Antariy

Quote from: dedndave on January 16, 2015, 08:38:55 AM
invoke crt__getch           ;<------- it's waiting for a keypress - lol
invoke crt_exit,edi


That's why I've said in the main post "the code is self-explanatory" :biggrin: Sorry for not saying that in the post.
So first time you've finished it with Ctrl+C? And the buffer just didn't dump to the file.

So I must note to Steve (FORTRANS), if you will run the tests, then second of them may take some time, you may rebuild the app as said in the main post with smaller loop count value so it will run faster, I just selected small enough values which give stable timings for my CPU, but don't know which value will be better for your CPU(s).