Hello everyone,
I'm facing a technical challenge related to CPU cycle counting in game development. My current setup uses RDTSC to read the system's main clock, which is crucial for achieving precise timing in games. However, I'm encountering synchronization problems when Intel's Turbo Boost feature is active. Disabling Turbo Boost in the BIOS seems to resolve the issue, but it negatively impacts the base clock speed, which is not ideal for gaming performance.
To address this, I considered overclocking the BCLK to maintain high frame rates, but I'd prefer a solution that allows accurate RDTSC counting even with Turbo Boost on. Here's how I'm currently reading the CPU frequency, which is only effective with Turbo Boost off: typedef struct _PROCESSOR_POWER_INFORMATION {
ULONG Number;
ULONG MaxMhz;
ULONG CurrentMhz;
ULONG MhzLimit;
ULONG MaxIdleState;
ULONG CurrentIdleState;
} PROCESSOR_POWER_INFORMATION, * PPROCESSOR_POWER_INFORMATION;
typedef struct _SYSTEM_BASIC_INFORMATION
{
ULONG Reserved;
ULONG TimerResolution;
ULONG PageSize;
ULONG NumberOfPhysicalPages;
ULONG LowestPhysicalPageNumber;
ULONG HighestPhysicalPageNumber;
ULONG AllocationGranularity;
ULONG_PTR MinimumUserModeAddress;
ULONG_PTR MaximumUserModeAddress;
KAFFINITY ActiveProcessorsAffinityMask;
CCHAR NumberOfProcessors;
} SYSTEM_BASIC_INFORMATION, * PSYSTEM_BASIC_INFORMATION;
#define MHZ_IN_HZ 1000000
#define BTAG 'BCmN'
static uint64_t determine_cpu_cycle_rate(void)
{
NTSTATUS status;
ULONG num_proc;
SYSTEM_BASIC_INFORMATION sysinfo;
PPROCESSOR_POWER_INFORMATION proc_pwr_info;
uint64_t cpu_cycle_freq;
// Get number of logical processors
status = ZwQuerySystemInformation(SystemBasicInformation, &sysinfo, sizeof(sysinfo), NULL);
if (!NT_SUCCESS(status)) {
return 0;
}
num_proc = sysinfo.NumberOfProcessors;
// Allocate memory for processor power information
proc_pwr_info = (PPROCESSOR_POWER_INFORMATION)ExAllocatePoolWithTag(NonPagedPool, num_proc * sizeof(PROCESSOR_POWER_INFORMATION), BTAG);
if (proc_pwr_info == NULL) {
return 0;
}
// Get CPU information
status = ZwPowerInformation(ProcessorInformation, NULL, 0, proc_pwr_info, num_proc * sizeof(PROCESSOR_POWER_INFORMATION));
if (!NT_SUCCESS(status)) {
ExFreePoolWithTag(proc_pwr_info, BTAG);
return 0;
}
// Compute CPU max clock rate and tick length
cpu_cycle_freq = proc_pwr_info[0].CurrentMhz * MHZ_IN_HZ;
// Free the allocated memory
ExFreePoolWithTag(proc_pwr_info, BTAG);
return cpu_cycle_freq;
}
And this is my MASM code for reading RDTSC, using lfence to handle out-of-order execution issues on Intel platforms: .code
TSCGet proc
lfence
rdtsc
shl rdx,32
or rdx, rax
ret
TSCGet endp
end
now my question is How can I accurately count CPU cycles using RDTSC in a gaming environment with Turbo Boost enabled is it possible ? thanks for help !
Hi v0xX,
You probably know that rdtsc is a can of worms. Microsoft discourages its use and recommends QueryPerformanceCounter instead, which I use in my NanoTimer (https://www.jj2007.eu/MasmBasicQuickReference.htm#Mb1171) macro. It is pretty reliable but counts micro- or milliseconds, not cycles.
Another macro, CyCt* (https://www.jj2007.eu/MasmBasicQuickReference.htm#Mb1479), uses a statistical approach to achieve more precise cycle counts, i.e. it cuts off outliers. Here is a typical plot for three different algos:
(http://www.jj2007.eu/pics/CyCtPlot.png)
To understand your problem better, could you tell us in a few words how exactly you are using your cycle count, and why precision is important for you?
I attach the plot program. Just press arrow right to refresh the timings, it's hilarious what Windows is doing behind the scenes ;-)
P.S.: Welcome to the forum :thup:
Hi
The crux of my implementation is converting these cycles into time measured in seconds. I do this by dividing the cycle count by the CPU's frequency. This division yields two values: time_seconds (the whole seconds) and time_fraction (the fractional second part). To avoid precision loss due to very high cycle counts, I've implemented a mechanism where the timer wraps more frequently (every week). This ensures that the fractional part of the time remains accurate. This method is essential for my gaming, where timing accuracy down to microseconds is critical. and thats how i read rdtsc
Hi v0xX
In your case, using the CPU RDTSC has some disadvantages. A better way is to use the high frequency interrupt counter. Some APIs use it, for example GetSystemTimePreciseAsFileTime (https://learn.microsoft.com/en-us/windows/win32/api/sysinfoapi/nf-sysinfoapi-getsystemtimepreciseasfiletime).
Check this https://devblogs.microsoft.com/oldnewthing/20170921-00/?p=97057 (https://devblogs.microsoft.com/oldnewthing/20170921-00/?p=97057).
Another possibility is to read the shared kernel memory at 7FFE0000h, which holds the tick count with an accuracy of about 100ns.
Biterider
Quote from: Biterider on November 22, 2023, 04:14:55 AMAnother possibility is to read the shared kernel memory at 7FFE0000h, which holds the tick count with an accuracy of about 100ns.
Where can I find that variable? I made a quick test, after reading Geoff Chappell's page (https://www.geoffchappell.com/studies/windows/km/ntoskrnl/inc/api/ntexapi_x/kuser_shared_data/index.htm), and it looks like there are the usual GetTickCount resolutions:
include \masm32\MasmBasic\MasmBasic.inc
Init
mov esi, 7FFE0000h
PrintLine HexDump$(esi, 64)
xor ecx, ecx
.Repeat
PrintLine Hex$([esi+8]), " ", Hex$([esi+14h]), " ", Hex$(rv(GetTickCount))
push 999999*2 ; tiny delay
.Repeat
dec stack
.Until Sign?
pop edx
inc ecx
.Until ecx>30
EndOfCode
Output:
7FFE0000 00 00 00 00 00 00 A0 0F 51 FE 53 51 18 05 00 00 .......QSQ....
7FFE0010 18 05 00 00 89 23 B9 EC B5 1C DA 01 B5 1C DA 01 ....#....
7FFE0020 00 98 3B 9E F7 FF FF FF F7 FF FF FF 64 86 64 86 .;dd
7FFE0030 43 00 3A 00 5C 00 57 00 69 00 6E 00 64 00 6F 00 C.:.\.W.i.n.d.o.
5153FE51 ECB92389 2163F9EB
51570548 ECBC2A92 2163F9FB
51570548 ECBC2A92 2163F9FB
5158F1F9 ECBE174E 2163FA0B
5159B54C ECBEDAA5 2163FA1A
515A5186 ECBF76E3 2163FA1A
515B99D3 ECC0BF37 2163FA1A
515B99D3 ECC0BF37 2163FA1A
515B99D3 ECC0BF37 2163FA1A
515E338C ECC358FF 2163FA2A
515ECFC6 ECC3F53D 2163FA3A
515F9322 ECC4B89D 2163FA3A
51607D91 ECC5A311 2163FA3A
5160D5E2 ECC5FB64 2163FA3A
5160D5E2 ECC5FB64 2163FA3A
5162A0F5 ECC7C682 2163FA49
5162A0F5 ECC7C682 2163FA49
5162A0F5 ECC7C682 2163FA49
5165034B ECCA28E6 2163FA59
51664D5B ECCB72FD 2163FA68
51664D5B ECCB72FD 2163FA68
51664D5B ECCB72FD 2163FA68
5168775E ECCD9D0C 2163FA78
5168775E ECCD9D0C 2163FA78
5168775E ECCD9D0C 2163FA78
516AF522 ECD01ADF 2163FA88
516AF522 ECD01ADF 2163FA88
516AF522 ECD01ADF 2163FA88
516D4B43 ECD2710D 2163FA97
516E0E7A ECD33449 2163FA97
516EAACB ECD3D09D 2163FA97
As you can see, the two counters at +8 and +14h are not exactly synchronised between them and with GetTickCount's low dword, but despite the tiny delay, they don't get updated between two or three loop iterations. With 999999*1, you get 5 to 10 identical readings. It's the standard 16ms resolution of GetTickCount, unfortunately :sad:
u can read through masm this one is 64 bit GetTickCount works in kernel/usermode
.code
MyGetTickCount64Kernel32 proc
mov ecx, dword ptr [7FFE0004h]
mov eax, 7FFE0320h
shl rcx, 20h
mov rax, qword ptr [rax]
shl rax, 8
mul rcx
mov rax, rdx
ret
MyGetTickCount64Kernel32 endp
end
Sure, but it's the same pattern:
xor ebx, ebx
@@:
PrintLine Hex$(rv(MyGetTickCount64Kernel32)), Tb$, Hex$(rv(GetTickCount))
xor ecx, ecx
x1: inc ecx
cmp ecx, 999999
jb x1
inc ebx
cmp ebx, 20
jb @B
Output:
2404592 21804592
2404592 21804592
2404592 21804592
2404592 218045a2
24045a2 218045a2
24045a2 218045a2
24045a2 218045a2
24045a2 218045a2
24045a2 218045a2
24045a2 218045a2
24045a2 218045a2
24045a2 218045a2
24045a2 218045a2
24045a2 218045a2
24045a2 218045b2
24045b2 218045b2
24045b2 218045b2
24045b2 218045b2
24045b2 218045b2
24045b2 218045b2
Standard GetTickCount resolution, unusable for your project. Try QueryPerformanceCounter, Micros*t says it returns nanoseconds, no matter what tricks your CPU is playing on you (throttling, turbo, ...).
Quote from: jj2007 on November 22, 2023, 07:35:57 AMSure, but it's the same pattern:
xor ebx, ebx
@@:
PrintLine Hex$(rv(MyGetTickCount64Kernel32)), Tb$, Hex$(rv(GetTickCount))
xor ecx, ecx
x1: inc ecx
cmp ecx, 999999
jb x1
inc ebx
cmp ebx, 20
jb @B
Output:
2404592 21804592
2404592 21804592
2404592 21804592
2404592 218045a2
24045a2 218045a2
24045a2 218045a2
24045a2 218045a2
24045a2 218045a2
24045a2 218045a2
24045a2 218045a2
24045a2 218045a2
24045a2 218045a2
24045a2 218045a2
24045a2 218045a2
24045a2 218045b2
24045b2 218045b2
24045b2 218045b2
24045b2 218045b2
24045b2 218045b2
24045b2 218045b2
Standard GetTickCount resolution, unusable for your project. Try QueryPerformanceCounter, Micros*t says it returns nanoseconds, no matter what tricks your CPU is playing on you (throttling, turbo, ...).
I've been using ZwQueryPerformanceCounter, which functions similarly to QueryPerformanceCounter. However, I'm exploring the rdtsc instruction for its superior precision.
My main challenge is dealing with the CPU's Turbo Boost feature, which seems to affect the accuracy of rdtsc. I suspect that reading the base frequency directly might bypass the Turbo Boost fluctuations. Here's the approach I'm considering:
unsigned long tsc_freq_mhz(void) {
msr_t msr = rdmsr(0xce);
return (BASE_CLOCK_MHZ * ((msr.lo >> 8) & 0xff)) * 1000000;
}
With this code, I aim to calculate the TSC frequency based on the MSR (Model Specific Register). but still not sure if it will fluctuation
Hi v0xX
Since Nehalem, the CPUs support the invariant TSC
From IntelĀ® 64 and IA-32 Architectures Software Developer's Manual (https://www.intel.com/content/www/us/en/developer/articles/technical/intel-sdm.html#:~:text=Intel%C2%AE%2064%20and%20IA-32%20Architectures%20Software%20Developer's%20Manual%20Combined%20Volumes%3A%201%2C%202A%2C%202B%2C%202C%2C%202D%2C%203A%2C%203B%2C%203C%2C%203D%2C%20and%204)
Quote18.17.1 Invariant TSC
The time stamp counter in newer processors may support an enhancement, referred to as invariant TSC. Processor's support for invariant TSC is indicated by CPUID.80000007H:EDX[8]. The invariant TSC will run at a constant rate in all ACPI P-, C-. and T-states. This is the architectural behavior moving forward. On processors with invariant TSC support, the OS may use the TSC for wall clock timer services (instead of ACPI or HPET timers). TSC reads are much more efficient and do not incur the overhead associated with a ring transition or access to a platform resource.
and
Quote18.17.4 Invariant Time-Keeping
The invariant TSC is based on the invariant timekeeping hardware (called Always Running Timer or ART), that runs at the core crystal clock frequency. The ratio defined by CPUID leaf 15H expresses the frequency relationship between the ART hardware and TSC. If CPUID.15H:EBX[31:0] != 0 and CPUID.80000007H:EDX[InvariantTSC] = 1, the following linearity relationship holds between TSC and the ART hardware: TSC_Value = (ART_Value * CPUID.15H:EBX[31:0] )/ CPUID.15H:EAX[31:0] + K Where 'K' is an offset that can be adjusted by a privileged agent3. When ART hardware is reset, both invariant TSC and K are also reset.
Biterider
Quote from: v0xX on November 21, 2023, 05:17:08 PMHow can I accurately count CPU cycles using RDTSC in a gaming environment with Turbo Boost enabled is it possible ?
I'm not a gamer, so pardon me if what I write looks silly:
- are you interested in
cycle counts, irrespectively of cpu speed (turbo boost, throttling)? In that case, rdtsc preceded by cpuid would be the way to go (beware of core switches, beware of interrupts);
- or do you want
time, i.e. nanoseconds between two events? Then QueryPerformance* is probably the best way to go (no idea how it treats interrupts, though).
In any case, you urgently need a testbed.
Quote from: jj2007 on November 22, 2023, 09:18:29 AMQuote from: v0xX on November 21, 2023, 05:17:08 PMHow can I accurately count CPU cycles using RDTSC in a gaming environment with Turbo Boost enabled is it possible ?
I'm not a gamer, so pardon me if what I write looks silly:
- are you interested in cycle counts, irrespectively of cpu speed (turbo boost, throttling)? In that case, rdtsc preceded by cpuid would be the way to go (beware of core switches, beware of interrupts);
- or do you want time, i.e. nanoseconds between two events? Then QueryPerformance* is probably the best way to go (no idea how it treats interrupts, though).
In any case, you urgently need a testbed.
Thank you for your suggestions and insights. My focus is indeed on measuring cycle counts, and for this, I'm using the rdtsc instruction preceded by cpuid to ensure accuracy in my measurements. I am also aware of the potential issues related to core switches and interrupts, and I'll be keeping these in mind as I proceed thanks !
Quote from: v0xX on November 22, 2023, 09:37:24 AMMy focus is indeed on measuring cycle counts, and for this, I'm using the rdtsc instruction preceded by cpuid
Ok. In that case, consider a "statistical" approach, i.e. take a hundred readings and eliminate the slowest 10% - see the image above in reply #1. Otherwise you will never get consistent results from rdtsc, mainly due to context switches.
figure out its nothing to with turbo boost I've been working on addressing the issue of pipeline unsynchronization when reading the Time Stamp Counter (TSC) on our system. Initially, I tried two different methods, but they both resulted in inconsistent cycle counts.
TSCGet proc
lfence
rdtsc
shl rdx, 32
or rax, rdx
ret
TSCGet endp
In this approach, I used LFENCE (Load Fence) followed by RDTSC (Read Time Stamp Counter). LFENCE was intended to act as a barrier for earlier reads, but it didn't provide full serialization, leading to some inconsistencies in the TSC reading due to out-of-order execution. Second Method - TSCGet with Double LFENCE: TSCGet proc
lfence
rdtsc
lfence
shl rdx, 32
or rax, rdx
ret
TSCGet endp
Here, I added an additional LFENCE after RDTSC. While this helped in mitigating some out-of-order execution issues, it still wasn't completely reliable for our needs. Finally, I developed a more robust solution that effectively fixed the pipeline synchronization issue:
align 16
TSCGet proc
push rbx
sub eax, eax
cpuid
rdtsc
shl rdx, 32
or rax, rdx
push rax
sub eax, eax
cpuid
pop rax
pop rbx
ret
TSCGet endp
end
This method involves using the CPUID instruction before and after RDTSC. The CPUID instruction acts as a complete serializing instruction, ensuring that every instruction preceding it is fully complete and the instruction pipeline is flushed. This guarantees that when RDTSC is called, it's done in a stable and synchronized state. By incorporating CPUID, I was able to get consistent and accurate readings of the TSC, resolving the pipeline synchronization issues we were facing." thanks for help guys !
RDTSC is only affected by base frequency overclocking and is usually the same as the base clock.
On Intel its inside CPUID. AMD has no straight-forward means of finding the frequency.
Quote from: InfiniteLoop on November 24, 2023, 04:54:04 PMRDTSC is only affected by base frequency overclocking and is usually the same as the base clock.
On Intel its inside CPUID. AMD has no straight-forward means of finding the frequency.
on amd u can read the freq through SystemHypervisorSharedPageInformation
static inline uint64_t get_rdtsc_freq(void) {
static uint64_t tsc_freq = 0;
volatile uint64_t* hypervisor_shared_page = NULL;
unsigned int size = 0;
// SystemHypervisorSharedPageInformation == 0xc5
int result = ZwQuerySystemInformation(SystemHypervisorSharedPageInformation, (void*)&hypervisor_shared_page, sizeof(hypervisor_shared_page), (PULONG)&size);
// success
if (size == sizeof(hypervisor_shared_page) && result >= 0) {
// docs say ReferenceTime = ((VirtualTsc * TscScale) >> 64)
// set ReferenceTime = 10000000 = 1 second @ 10MHz, solve for VirtualTsc
// => VirtualTsc = 10000000 / (TscScale >> 64)
tsc_freq = (10000000ull << 32) / (hypervisor_shared_page[1] >> 32);
// If your build configuration supports 128 bit arithmetic, do this:
// tsc_freq = ((unsigned __int128)10000000ull << (unsigned __int128)64ull) / hypervisor_shared_page[1];
return tsc_freq;
}
}
So far I understand, no CPU can read its own frequency.
What CPU does is to access the clock, but that depend on frequency, and the clock.
Then, at best, only it's posible to obtain an estimation between 2 access to the clock. And because that look random must be measured several times.
No?
A simple library to sample the frequency on single CPU cores. The frequency is sampled by computing the ratio of actual performed cycles to the cycles that have passed in base frequency according to rdtsc.
https://github.com/intel/intel-cpu-frequency-library (https://github.com/intel/intel-cpu-frequency-library)
New CPUs have "constant timestamp counter frequency" feature. This means that the timer which is queried by rdtsc instruction doesn't change its frequency when CPU cores are overclocked or downlocked by turboboost. It also means that you can not detect current CPU frequency by comparing rdtsc progress to HPET progress.
Monitoring TurboBoost frequency tool
https://github.com/shimada-k/turbofreq (https://github.com/shimada-k/turbofreq)
Acquiring high-resolution time stamps
https://learn.microsoft.com/en-us/windows/win32/sysinfo/acquiring-high-resolution-time-stamps (https://learn.microsoft.com/en-us/windows/win32/sysinfo/acquiring-high-resolution-time-stamps)
Running old games on my new laptop i use energy setting with disabled turbo ,because the old games are made to run as fast as possible
Wouldnt that be alternative measure with and without turbo and compare ?
Quote from: v0xX on November 25, 2023, 08:54:48 PMon amd u can read the freq through SystemHypervisorSharedPageInformation
That code doesn't make sense. Hmm let me adjust it.
No joy. Its zero.
//SystemHypervisorSharedPageInformation = 0xC5 = 197
NTSTATUS(*GetInfo)(SYSTEM_INFORMATION_CLASS, PVOID, ULONG, PULONG);
HMODULE hh = LoadLibraryA("Ntdll.dll");
if (hh != INVALID_HANDLE_VALUE)
{
GetInfo = (NTSTATUS(*)(SYSTEM_INFORMATION_CLASS, PVOID, ULONG, PULONG))GetProcAddress(hh, "ZwQuerySystemInformation");
}
else
std::cout << "bad handle \r\n";
if (GetInfo == nullptr)
{
std::cout << "bad address\r\n";
}
else
{
unsigned long long* info;
unsigned int l = 0;
LARGE_INTEGER p;
NTSTATUS t;
t = GetInfo((SYSTEM_INFORMATION_CLASS)197, info, 0, (PULONG)&l);
if (t > 0x7FFFFFFF)
std::cout << t << " error A\r\n";
else if (l > 0)
{
info = (unsigned long long*)_aligned_malloc(8 * l, 64);
for(int i = 0; i < l; i++)
info[i] = 0;
t = GetInfo((SYSTEM_INFORMATION_CLASS)197, info, l, nullptr);
if (t > 0x7FFFFFFF)
std::cout << t << " error B\r\n";
else
{
unsigned long long tsc = info[1];
std::cout << "TSC Frequ " << tsc << "\r\n";
}
_aligned_free(info);
}
}
CloseHandle(hh);
after some research here little fix for it :) disable intelppm.sys disable performance boost mode enable speedstep/turbo in bios intelppm not use the turbo of speedstep / only speedshift use it then set Attributes to 2 on HKEY_LOCAL_MACHINE\SYSTEM\CurrentControlSet\Control\Power\PowerSettings\54533251-82be-4824-96c1-47b60b740d00\be337238-0d82-4146-a960-4f3749d470c7 key rdtsc cycle correctly with turbo without any issue :skrewy: