News:

Masm32 SDK description, downloads and other helpful links
Message to All Guests
NB: Posting URL's See here: Posted URL Change

Main Menu

Count cycles: test reports needed

Started by Gunther, May 24, 2022, 06:30:30 AM

Previous topic - Next topic

Gunther

Counting cycles isn't an easy thing to do. We've a fundamental problem with the usual macros. As a normal application, we are running in ring 3. This means: There is no direct hardware access for us.
We know little about task switches, micro-operations, cache misses or branch mispredictions. In principle, we could determine this, but to do so requires write access to certain control registers. That' s
not possible in ring 3.

What can we do? Under Linux we could use a kernel module for this. This would guarantee an exclusive access to the CPU. Under BSD or Windows a driver would have to be used for this. Only then would
the measured values be reliable and meaningful. This path will be tedious and cost a lot of time.

But there's another option. Under plain DOS there are no task switches. Exclusive access to the CPU is moreover guaranteed. I wrote a small program that counts the cycles for a short code sequence. This
is repeated 20 times to prevent cache warm-up effects. All 20 results are printed at the end. That serves only for the information, from where the values stabilize. Of course, the median, arithmetic mean
or the variance could be calculated. But that would only be a statistical ironing procedure without any factual background.

The application is written with PowerBASIC and JWASM. This is the first step. I'm working on a version that's completely written in assembly language. A short sequence of FPU instructions is tested: load
double, 4 floating point divisions, save double. If the program is tested, it can't do any harm. Here is the output under DOSBox 0.74-3:

Sorry!
The usage of the Time Stamp Counter isn't possible
with the available CPU.
Program ends now.


This is correct, because DOSBox only emulates an 80486. The Time Stamp Counter came with the Pentium. Here is the output under FreeDOS running under VirtualBox:

Iteration 1:  104 Cycles
Iteration 2:  106 Cycles
Iteration 3:  104 Cycles
Iteration 4:  106 Cycles
Iteration 5:  104 Cycles
Iteration 6:  108 Cycles
Iteration 7:  100 Cycles
Iteration 8:  102 Cycles
Iteration 9:  98 Cycles
Iteration 10:  108 Cycles
Iteration 11:  104 Cycles
Iteration 12:  106 Cycles
Iteration 13:  100 Cycles
Iteration 14:  108 Cycles
Iteration 15:  104 Cycles
Iteration 16:  108 Cycles
Iteration 17:  102 Cycles
Iteration 18:  102 Cycles
Iteration 19:  100 Cycles
Iteration 20:  106 Cycles

Please, press any key to end the application...


The same machine, application started under plain FreeDOS without any drivers:

Iteration 1:  82 Cycles
Iteration 2:  82 Cycles
Iteration 3:  84 Cycles
Iteration 4:  84 Cycles
Iteration 5:  86 Cycles
Iteration 6:  82 Cycles
Iteration 7:  82 Cycles
Iteration 8:  82 Cycles
Iteration 9:  84 Cycles
Iteration 10:  82 Cycles
Iteration 11:  82 Cycles
Iteration 12:  82 Cycles
Iteration 13:  82 Cycles
Iteration 14:  82 Cycles
Iteration 15:  82 Cycles
Iteration 16:  82 Cycles
Iteration 17:  82 Cycles
Iteration 18:  82 Cycles
Iteration 19:  82 Cycles
Iteration 20:  82 Cycles

Please, press any key to end the application...


Where does the difference came from? Why is the program slower under VirtualBox? Well, as mentioned at the beginning: We are an application in ring 3 and are additionally emulated.

Nevertheless, I would be happy about test runs and reports in other environments.
You have to know the facts before you can distort them.

FORTRANS

Hi,

   P-III, Windows 2000 displays:

Command Prompt - cc
The NTVDM CPU has encountered an illegal instruction.
CS:0b33 IP:0044 OP:0f01f966a3 Choose 'Close' to terminate the application.


   Windows XP displays a similar message.  OS/2 displays a fancier
message that says the same sort of thing.  Address 44 in all three.

Regards,

Steve

Gunther

Steve,

Quote from: FORTRANS on May 24, 2022, 07:20:44 AM
   Windows XP displays a similar message.  OS/2 displays a fancier
message that says the same sort of thing.  Address 44 in all three.

yes, I've checked it with XP Mode (Windows Virtual PC) running under Win 7. I've the same effect here. Apparently these emulations don't like the rdtsc instruction.
You have to know the facts before you can distort them.

HSE

#3
Hi Gunther!

In VirtualBox FD1.3 mean is 3000 Cycles running from emulated hdd in FAT32 USB, but 20 Cycles if I copy same file to virtual unit A: (a little difference  :biggrin:)

HSE 

Later:
         Don't have much sense, everything depend on number of cores in virtual machine.
         With more than 1 core, USB run in some ¿slow thread?. Values are between 3000 an 6000 cycles.
         With only one core original CC result is 88-96 cycles, only rtdsc is 18-24 cycles and replacing rtdscp wit cpuid-rtdsc is 2846-4464 cycles.
         Something don't work like expected in FreeDos, VirtualBox or both.         
Equations in Assembly: SmplMath

HSE

Hi Fortrans!

Quote from: FORTRANS on May 24, 2022, 07:20:44 AM
   P-III, Windows 2000 displays:

I can guess that your machine can't deal with rdtscp, must be replaced with older rdtsc (if I understand well  :biggrin:)
Equations in Assembly: SmplMath

_japheth

Quote from: Gunther on May 24, 2022, 06:30:30 AM
Where does the difference came from? Why is the program slower under VirtualBox? Well, as mentioned at the beginning: We are an application in ring 3 and are additionally emulated.

Nevertheless, I would be happy about test runs and reports in other environments.

Here's a result running on bare MS-DOS :

Iteration 1:  108 Cycles
Iteration 2:  72 Cycles
Iteration 3:  108 Cycles
Iteration 4:  108 Cycles
Iteration 5:  72 Cycles
Iteration 6:  108 Cycles
Iteration 7:  108 Cycles
Iteration 8:  72 Cycles
Iteration 9:  108 Cycles
Iteration 10:  108 Cycles
Iteration 11:  72 Cycles
Iteration 12:  108 Cycles
Iteration 13:  108 Cycles
Iteration 14:  72 Cycles
Iteration 15:  108 Cycles
Iteration 16:  108 Cycles
Iteration 17:  72 Cycles
Iteration 18:  108 Cycles
Iteration 19:  108 Cycles
Iteration 20:  72 Cycles


Doesn't look too good. Even if running in "ring 0" and with interrupts disabled, measuring time for just a few instructions is perhaps not really feasible - because there's the SMM mode which even the OS cannot prevent to be executed.
Dummheit, gepaart mit Dreistigkeit - eine furchtbare Macht.

Gunther

HSE,

Quote from: HSE on May 24, 2022, 08:36:38 AM
I can guess that your machine can't deal with rdtscp, must be replaced with older rdtsc (if I understand well  :biggrin:)

that may be. I replaced RDTSCP with RDTSC and then the code runs in XP Mode. This means that the entire CPU is emulated - but rather lousy. There are good reasons
why I used RDTSCP in this method. The Intel® 64 and IA-32 Architectures Software Developer's Manual Volume 2B states:
Quote
The RDTSCP instruction waits until all previous instructions have been executed before reading the counter. However, subsequent instructions may begin execution before the read
operation is performed.
This means that this instruction guarantees that everything that is above its call in the source code is executed before the instruction itself is called. A simple substitution with RDTSC
will not work with the method I'm using. The whole approach must then be changed. I'll think about it.
You have to know the facts before you can distort them.

Gunther

Andreas,

Quote from: _japheth on May 24, 2022, 03:26:19 PM
Doesn't look too good. Even if running in "ring 0" and with interrupts disabled, measuring time for just a few instructions is perhaps not really feasible - because there's the SMM mode which even the OS cannot prevent to be executed.

that's absolutely right. We have the SMM since the 80386 SL, but this is probably one of the most underestimated phenomena. Under Linux there is a kernel module that registers system management interrupts. Using it, the consumed
cycles can be corrected. But that's easier said than done. Maybe there is a chance to do this under plain DOS? I'm thinking about it.

Thank you for your report.
You have to know the facts before you can distort them.

FORTRANS

Hi,

Quote from: HSE on May 24, 2022, 08:36:38 AM
Quote from: FORTRANS on May 24, 2022, 07:20:44 AM
   P-III, Windows 2000 displays:

I can guess that your machine can't deal with rdtscp, must be replaced with older rdtsc (if I understand well  :biggrin:)

   I booted MS-DOS on this machine, and the program crashed.
So, good call.

   To answer a vague question, I used FTP to put the program
into the VirtualBox on my Windows 8.1 laptop.  And this time
Gunther's program ran in a VDM.  Here are the results.

Iteration 1:  132 Cycles
Iteration 2:  132 Cycles
Iteration 3:  132 Cycles
Iteration 4:  132 Cycles
Iteration 5:  132 Cycles
Iteration 6:  132 Cycles
Iteration 7:  132 Cycles
Iteration 8:  132 Cycles
Iteration 9:  132 Cycles
Iteration 10:  132 Cycles
Iteration 11:  132 Cycles
Iteration 12:  132 Cycles
Iteration 13:  132 Cycles
Iteration 14:  132 Cycles
Iteration 15:  132 Cycles
Iteration 16:  132 Cycles
Iteration 17:  132 Cycles
Iteration 18:  132 Cycles
Iteration 19:  132 Cycles
Iteration 20:  132 Cycles

Please, press any key to end the application...


Regards,

Steve N.

Gunther

Steve,

Quote from: FORTRANS on May 25, 2022, 04:27:19 AM
   I booted MS-DOS on this machine, and the program crashed.
So, good call.

yes, that has to do with the RDTSCP instruction, which isn't supported by your P-III. I'm working on an improved version that checks this separately.

Quote from: FORTRANS on May 25, 2022, 04:27:19 AM
   To answer a vague question, I used FTP to put the program
into the VirtualBox on my Windows 8.1 laptop.  And this time
Gunther's program ran in a VDM.

Yes, it's another CPU.

Thank you for your report.
You have to know the facts before you can distort them.