The MASM Forum

General => The Laboratory => Topic started by: v0xX on November 21, 2023, 05:17:08 PM

Title: rdtsc turbo mode
Post by: v0xX on November 21, 2023, 05:17:08 PM
Hello everyone,

I'm facing a technical challenge related to CPU cycle counting in game development. My current setup uses RDTSC to read the system's main clock, which is crucial for achieving precise timing in games. However, I'm encountering synchronization problems when Intel's Turbo Boost feature is active. Disabling Turbo Boost in the BIOS seems to resolve the issue, but it negatively impacts the base clock speed, which is not ideal for gaming performance.

To address this, I considered overclocking the BCLK to maintain high frame rates, but I'd prefer a solution that allows accurate RDTSC counting even with Turbo Boost on. Here's how I'm currently reading the CPU frequency, which is only effective with Turbo Boost off: typedef struct _PROCESSOR_POWER_INFORMATION {
ULONG Number;
ULONG MaxMhz;
ULONG CurrentMhz;
ULONG MhzLimit;
ULONG MaxIdleState;
ULONG CurrentIdleState;
} PROCESSOR_POWER_INFORMATION, * PPROCESSOR_POWER_INFORMATION;

typedef struct _SYSTEM_BASIC_INFORMATION
{
ULONG Reserved;
ULONG TimerResolution;
ULONG PageSize;
ULONG NumberOfPhysicalPages;
ULONG LowestPhysicalPageNumber;
ULONG HighestPhysicalPageNumber;
ULONG AllocationGranularity;
ULONG_PTR MinimumUserModeAddress;
ULONG_PTR MaximumUserModeAddress;
KAFFINITY ActiveProcessorsAffinityMask;
CCHAR NumberOfProcessors;
} SYSTEM_BASIC_INFORMATION, * PSYSTEM_BASIC_INFORMATION;

#define MHZ_IN_HZ       1000000
#define BTAG 'BCmN'

static uint64_t determine_cpu_cycle_rate(void)
{
NTSTATUS status;
ULONG num_proc;
SYSTEM_BASIC_INFORMATION sysinfo;
PPROCESSOR_POWER_INFORMATION proc_pwr_info;
uint64_t cpu_cycle_freq;

// Get number of logical processors
status = ZwQuerySystemInformation(SystemBasicInformation, &sysinfo, sizeof(sysinfo), NULL);
if (!NT_SUCCESS(status)) {
return 0;
}
num_proc = sysinfo.NumberOfProcessors;

// Allocate memory for processor power information
proc_pwr_info = (PPROCESSOR_POWER_INFORMATION)ExAllocatePoolWithTag(NonPagedPool, num_proc * sizeof(PROCESSOR_POWER_INFORMATION), BTAG);
if (proc_pwr_info == NULL) {
return 0;
}

// Get CPU information
status = ZwPowerInformation(ProcessorInformation, NULL, 0, proc_pwr_info, num_proc * sizeof(PROCESSOR_POWER_INFORMATION));
if (!NT_SUCCESS(status)) {
ExFreePoolWithTag(proc_pwr_info, BTAG);
return 0;
}

// Compute CPU max clock rate and tick length
cpu_cycle_freq = proc_pwr_info[0].CurrentMhz * MHZ_IN_HZ;

// Free the allocated memory
ExFreePoolWithTag(proc_pwr_info, BTAG);

return cpu_cycle_freq;
}
And this is my MASM code for reading RDTSC, using lfence to handle out-of-order execution issues on Intel platforms: .code

TSCGet proc
lfence
rdtsc
shl rdx,32
or rdx, rax
ret
TSCGet endp

end
now my question is How can I accurately count CPU cycles using RDTSC in a gaming environment with Turbo Boost enabled is it possible ? thanks  for help !
Title: Re: rdtsc turbo mode
Post by: jj2007 on November 21, 2023, 09:05:09 PM
Hi v0xX,

You probably know that rdtsc is a can of worms. Microsoft discourages its use and recommends QueryPerformanceCounter instead, which I use in my NanoTimer (https://www.jj2007.eu/MasmBasicQuickReference.htm#Mb1171) macro. It is pretty reliable but counts micro- or milliseconds, not cycles.

Another macro, CyCt* (https://www.jj2007.eu/MasmBasicQuickReference.htm#Mb1479), uses a statistical approach to achieve more precise cycle counts, i.e. it cuts off outliers. Here is a typical plot for three different algos:

(http://www.jj2007.eu/pics/CyCtPlot.png)

To understand your problem better, could you tell us in a few words how exactly you are using your cycle count, and why precision is important for you?

I attach the plot program. Just press arrow right to refresh the timings, it's hilarious what Windows is doing behind the scenes ;-)

P.S.: Welcome to the forum :thup:
Title: Re: rdtsc turbo mode
Post by: v0xX on November 22, 2023, 02:33:46 AM
Hi

The crux of my implementation is converting these cycles into time measured in seconds. I do this by dividing the cycle count by the CPU's frequency. This division yields two values: time_seconds (the whole seconds) and time_fraction   (the fractional second part).  To avoid precision loss due to very high cycle counts, I've implemented a mechanism where the timer wraps more frequently (every week). This ensures that the fractional part of the time remains accurate.  This method is essential for my gaming, where timing accuracy down to microseconds is critical. and thats how i read rdtsc
Title: Re: rdtsc turbo mode
Post by: Biterider on November 22, 2023, 04:14:55 AM
Hi v0xX
In your case, using the CPU RDTSC has some disadvantages. A better way is to use the high frequency interrupt counter. Some APIs use it, for example GetSystemTimePreciseAsFileTime (https://learn.microsoft.com/en-us/windows/win32/api/sysinfoapi/nf-sysinfoapi-getsystemtimepreciseasfiletime).
Check this https://devblogs.microsoft.com/oldnewthing/20170921-00/?p=97057 (https://devblogs.microsoft.com/oldnewthing/20170921-00/?p=97057).

Another possibility is to read the shared kernel memory at 7FFE0000h, which holds the tick count with an accuracy of about 100ns.

Biterider
Title: Re: rdtsc turbo mode
Post by: jj2007 on November 22, 2023, 07:05:07 AM
Quote from: Biterider on November 22, 2023, 04:14:55 AMAnother possibility is to read the shared kernel memory at 7FFE0000h, which holds the tick count with an accuracy of about 100ns.

Where can I find that variable? I made a quick test, after reading Geoff Chappell's page (https://www.geoffchappell.com/studies/windows/km/ntoskrnl/inc/api/ntexapi_x/kuser_shared_data/index.htm), and it looks like there are the usual GetTickCount resolutions:

include \masm32\MasmBasic\MasmBasic.inc
  Init
  mov esi, 7FFE0000h
  PrintLine HexDump$(esi, 64)
  xor ecx, ecx
  .Repeat
    PrintLine Hex$([esi+8]), " ", Hex$([esi+14h]), " ", Hex$(rv(GetTickCount))
    push 999999*2  ; tiny delay
    .Repeat
        dec stack
    .Until Sign?
    pop edx
    inc ecx
  .Until ecx>30
EndOfCode

Output:
7FFE0000  00 00 00 00 00 00 A0 0F 51 FE 53 51 18 05 00 00 .......QSQ....
7FFE0010  18 05 00 00 89 23 B9 EC B5 1C DA 01 B5 1C DA 01 ....#....
7FFE0020  00 98 3B 9E F7 FF FF FF F7 FF FF FF 64 86 64 86 .;dd
7FFE0030  43 00 3A 00 5C 00 57 00 69 00 6E 00 64 00 6F 00 C.:.\.W.i.n.d.o.

5153FE51 ECB92389 2163F9EB
51570548 ECBC2A92 2163F9FB
51570548 ECBC2A92 2163F9FB
5158F1F9 ECBE174E 2163FA0B
5159B54C ECBEDAA5 2163FA1A
515A5186 ECBF76E3 2163FA1A
515B99D3 ECC0BF37 2163FA1A
515B99D3 ECC0BF37 2163FA1A
515B99D3 ECC0BF37 2163FA1A
515E338C ECC358FF 2163FA2A
515ECFC6 ECC3F53D 2163FA3A
515F9322 ECC4B89D 2163FA3A
51607D91 ECC5A311 2163FA3A
5160D5E2 ECC5FB64 2163FA3A
5160D5E2 ECC5FB64 2163FA3A
5162A0F5 ECC7C682 2163FA49
5162A0F5 ECC7C682 2163FA49
5162A0F5 ECC7C682 2163FA49
5165034B ECCA28E6 2163FA59
51664D5B ECCB72FD 2163FA68
51664D5B ECCB72FD 2163FA68
51664D5B ECCB72FD 2163FA68
5168775E ECCD9D0C 2163FA78
5168775E ECCD9D0C 2163FA78
5168775E ECCD9D0C 2163FA78
516AF522 ECD01ADF 2163FA88
516AF522 ECD01ADF 2163FA88
516AF522 ECD01ADF 2163FA88
516D4B43 ECD2710D 2163FA97
516E0E7A ECD33449 2163FA97
516EAACB ECD3D09D 2163FA97

As you can see, the two counters at +8 and +14h are not exactly synchronised between them and with GetTickCount's low dword, but despite the tiny delay, they don't get updated between two or three loop iterations. With 999999*1, you get 5 to 10 identical readings. It's the standard 16ms resolution of GetTickCount, unfortunately :sad:
Title: Re: rdtsc turbo mode
Post by: v0xX on November 22, 2023, 07:18:09 AM
u can read through masm this one is 64 bit GetTickCount works in kernel/usermode


.code
MyGetTickCount64Kernel32 proc
                mov     ecx, dword ptr [7FFE0004h]
                mov     eax, 7FFE0320h
                shl     rcx, 20h
                mov     rax, qword ptr [rax]
                shl     rax, 8
                mul     rcx
                mov     rax, rdx
                ret
MyGetTickCount64Kernel32 endp
end
Title: Re: rdtsc turbo mode
Post by: jj2007 on November 22, 2023, 07:35:57 AM
Sure, but it's the same pattern:
  xor ebx, ebx
@@:
  PrintLine Hex$(rv(MyGetTickCount64Kernel32)), Tb$, Hex$(rv(GetTickCount))
  xor ecx, ecx
x1:    inc ecx
    cmp ecx, 999999
    jb x1
  inc ebx
  cmp ebx, 20
  jb @B

Output:
2404592 21804592
2404592 21804592
2404592 21804592
2404592 218045a2
24045a2 218045a2
24045a2 218045a2
24045a2 218045a2
24045a2 218045a2
24045a2 218045a2
24045a2 218045a2
24045a2 218045a2
24045a2 218045a2
24045a2 218045a2
24045a2 218045a2
24045a2 218045b2
24045b2 218045b2
24045b2 218045b2
24045b2 218045b2
24045b2 218045b2
24045b2 218045b2

Standard GetTickCount resolution, unusable for your project. Try QueryPerformanceCounter, Micros*t says it returns nanoseconds, no matter what tricks your CPU is playing on you (throttling, turbo, ...).
Title: Re: rdtsc turbo mode
Post by: v0xX on November 22, 2023, 08:50:44 AM
Quote from: jj2007 on November 22, 2023, 07:35:57 AMSure, but it's the same pattern:
  xor ebx, ebx
@@:
  PrintLine Hex$(rv(MyGetTickCount64Kernel32)), Tb$, Hex$(rv(GetTickCount))
  xor ecx, ecx
x1:    inc ecx
    cmp ecx, 999999
    jb x1
  inc ebx
  cmp ebx, 20
  jb @B

Output:
2404592 21804592
2404592 21804592
2404592 21804592
2404592 218045a2
24045a2 218045a2
24045a2 218045a2
24045a2 218045a2
24045a2 218045a2
24045a2 218045a2
24045a2 218045a2
24045a2 218045a2
24045a2 218045a2
24045a2 218045a2
24045a2 218045a2
24045a2 218045b2
24045b2 218045b2
24045b2 218045b2
24045b2 218045b2
24045b2 218045b2
24045b2 218045b2

Standard GetTickCount resolution, unusable for your project. Try QueryPerformanceCounter, Micros*t says it returns nanoseconds, no matter what tricks your CPU is playing on you (throttling, turbo, ...).

I've been using ZwQueryPerformanceCounter, which functions similarly to QueryPerformanceCounter. However, I'm exploring the rdtsc instruction for its superior precision.

My main challenge is dealing with the CPU's Turbo Boost feature, which seems to affect the accuracy of rdtsc. I suspect that reading the base frequency directly might bypass the Turbo Boost fluctuations. Here's the approach I'm considering: unsigned long tsc_freq_mhz(void) {
    msr_t msr = rdmsr(0xce);
    return (BASE_CLOCK_MHZ * ((msr.lo >> 8) & 0xff)) * 1000000;
}
  With this code, I aim to calculate the TSC frequency based on the MSR (Model Specific Register).  but still not sure if it will fluctuation
Title: Re: rdtsc turbo mode
Post by: Biterider on November 22, 2023, 09:17:22 AM
Hi v0xX
Since Nehalem, the CPUs support the invariant TSC

From IntelĀ® 64 and IA-32 Architectures Software Developer's Manual (https://www.intel.com/content/www/us/en/developer/articles/technical/intel-sdm.html#:~:text=Intel%C2%AE%2064%20and%20IA-32%20Architectures%20Software%20Developer's%20Manual%20Combined%20Volumes%3A%201%2C%202A%2C%202B%2C%202C%2C%202D%2C%203A%2C%203B%2C%203C%2C%203D%2C%20and%204)

Quote18.17.1 Invariant TSC
The time stamp counter in newer processors may support an enhancement, referred to as invariant TSC. Processor's support for invariant TSC is indicated by CPUID.80000007H:EDX[8]. The invariant TSC will run at a constant rate in all ACPI P-, C-. and T-states. This is the architectural behavior moving forward. On processors with invariant TSC support, the OS may use the TSC for wall clock timer services (instead of ACPI or HPET timers). TSC reads are much more efficient and do not incur the overhead associated with a ring transition or access to a platform resource.
and
Quote18.17.4 Invariant Time-Keeping
The invariant TSC is based on the invariant timekeeping hardware (called Always Running Timer or ART), that runs at the core crystal clock frequency. The ratio defined by CPUID leaf 15H expresses the frequency relationship between the ART hardware and TSC. If CPUID.15H:EBX[31:0] != 0 and CPUID.80000007H:EDX[InvariantTSC] = 1, the following linearity relationship holds between TSC and the ART hardware: TSC_Value = (ART_Value * CPUID.15H:EBX[31:0] )/ CPUID.15H:EAX[31:0] + K Where 'K' is an offset that can be adjusted by a privileged agent3. When ART hardware is reset, both invariant TSC and K are also reset.

Biterider
Title: Re: rdtsc turbo mode
Post by: jj2007 on November 22, 2023, 09:18:29 AM
Quote from: v0xX on November 21, 2023, 05:17:08 PMHow can I accurately count CPU cycles using RDTSC in a gaming environment with Turbo Boost enabled is it possible ?

I'm not a gamer, so pardon me if what I write looks silly:

- are you interested in cycle counts, irrespectively of cpu speed (turbo boost, throttling)? In that case, rdtsc preceded by cpuid would be the way to go (beware of core switches, beware of interrupts);

- or do you want time, i.e. nanoseconds between two events? Then QueryPerformance* is probably the best way to go (no idea how it treats interrupts, though).

In any case, you urgently need a testbed.
Title: Re: rdtsc turbo mode
Post by: v0xX on November 22, 2023, 09:37:24 AM
Quote from: jj2007 on November 22, 2023, 09:18:29 AM
Quote from: v0xX on November 21, 2023, 05:17:08 PMHow can I accurately count CPU cycles using RDTSC in a gaming environment with Turbo Boost enabled is it possible ?

I'm not a gamer, so pardon me if what I write looks silly:

- are you interested in cycle counts, irrespectively of cpu speed (turbo boost, throttling)? In that case, rdtsc preceded by cpuid would be the way to go (beware of core switches, beware of interrupts);

- or do you want time, i.e. nanoseconds between two events? Then QueryPerformance* is probably the best way to go (no idea how it treats interrupts, though).

In any case, you urgently need a testbed.

Thank you for your suggestions and insights. My focus is indeed on measuring cycle counts, and for this, I'm using the rdtsc instruction preceded by cpuid to ensure accuracy in my measurements. I am also aware of the potential issues related to core switches and interrupts, and I'll be keeping these in mind as I proceed thanks !
Title: Re: rdtsc turbo mode
Post by: jj2007 on November 22, 2023, 10:32:28 AM
Quote from: v0xX on November 22, 2023, 09:37:24 AMMy focus is indeed on measuring cycle counts, and for this, I'm using the rdtsc instruction preceded by cpuid

Ok. In that case, consider a "statistical" approach, i.e. take a hundred readings and eliminate the slowest 10% - see the image above in reply #1. Otherwise you will never get consistent results from rdtsc, mainly due to context switches.
Title: Re: rdtsc turbo mode
Post by: v0xX on November 22, 2023, 05:23:15 PM
figure out its nothing to with turbo boost  I've been working on addressing the issue of pipeline unsynchronization when reading the Time Stamp Counter (TSC) on our system. Initially, I tried two different methods, but they both resulted in inconsistent cycle counts.

TSCGet proc
    lfence
    rdtsc
    shl rdx, 32
    or rax, rdx
    ret
TSCGet endp

In this approach, I used LFENCE (Load Fence) followed by RDTSC (Read Time Stamp Counter). LFENCE was intended to act as a barrier for earlier reads, but it didn't provide full serialization, leading to some inconsistencies in the TSC reading due to out-of-order execution.  Second Method - TSCGet with Double LFENCE: TSCGet proc
    lfence
    rdtsc
    lfence
    shl rdx, 32
    or rax, rdx
    ret
TSCGet endp
  Here, I added an additional LFENCE after RDTSC. While this helped in mitigating some out-of-order execution issues, it still wasn't completely reliable for our needs.  Finally, I developed a more robust solution that effectively fixed the pipeline synchronization issue:

align 16
TSCGet proc
    push rbx
    sub eax, eax
    cpuid
    rdtsc
    shl rdx, 32
    or rax, rdx
    push rax
    sub eax, eax
    cpuid
    pop rax
    pop rbx
    ret
TSCGet endp
end
This method involves using the CPUID instruction before and after RDTSC. The CPUID instruction acts as a complete serializing instruction, ensuring that every instruction preceding it is fully complete and the instruction pipeline is flushed. This guarantees that when RDTSC is called, it's done in a stable and synchronized state. By incorporating CPUID, I was able to get consistent and accurate readings of the TSC, resolving the pipeline synchronization issues we were facing." thanks for help guys !
Title: Re: rdtsc turbo mode
Post by: InfiniteLoop on November 24, 2023, 04:54:04 PM
RDTSC is only affected by base frequency overclocking and is usually the same as the base clock.
On Intel its inside CPUID. AMD has no straight-forward means of finding the frequency.
Title: Re: rdtsc turbo mode
Post by: v0xX on November 25, 2023, 08:54:48 PM
Quote from: InfiniteLoop on November 24, 2023, 04:54:04 PMRDTSC is only affected by base frequency overclocking and is usually the same as the base clock.
On Intel its inside CPUID. AMD has no straight-forward means of finding the frequency.

on amd u can read the freq through  SystemHypervisorSharedPageInformation

static inline uint64_t get_rdtsc_freq(void) {


static uint64_t tsc_freq = 0;


volatile uint64_t* hypervisor_shared_page = NULL;
unsigned int size = 0;

// SystemHypervisorSharedPageInformation == 0xc5
int result = ZwQuerySystemInformation(SystemHypervisorSharedPageInformation, (void*)&hypervisor_shared_page, sizeof(hypervisor_shared_page), (PULONG)&size);

// success
if (size == sizeof(hypervisor_shared_page) && result >= 0) {
// docs say ReferenceTime = ((VirtualTsc * TscScale) >> 64)
//      set ReferenceTime = 10000000 = 1 second @ 10MHz, solve for VirtualTsc
//       =>    VirtualTsc = 10000000 / (TscScale >> 64)
tsc_freq = (10000000ull << 32) / (hypervisor_shared_page[1] >> 32);
// If your build configuration supports 128 bit arithmetic, do this:
// tsc_freq = ((unsigned __int128)10000000ull << (unsigned __int128)64ull) / hypervisor_shared_page[1];
return tsc_freq;
}
}
Title: Re: rdtsc turbo mode
Post by: HSE on November 25, 2023, 09:22:28 PM
So far I understand, no CPU can read its own frequency.

What CPU does is to access the clock, but that depend on frequency, and the clock.

Then, at best, only it's posible to obtain an estimation between 2 access to the clock. And because that look random must be measured several times.

No?
Title: Re: rdtsc turbo mode
Post by: LiaoMi on November 26, 2023, 05:24:58 AM
A simple library to sample the frequency on single CPU cores. The frequency is sampled by computing the ratio of actual performed cycles to the cycles that have passed in base frequency according to rdtsc.
https://github.com/intel/intel-cpu-frequency-library (https://github.com/intel/intel-cpu-frequency-library)

New CPUs have "constant timestamp counter frequency" feature. This means that the timer which is queried by rdtsc instruction doesn't change its frequency when CPU cores are overclocked or downlocked by turboboost. It also means that you can not detect current CPU frequency by comparing rdtsc progress to HPET progress.

Monitoring TurboBoost frequency tool
https://github.com/shimada-k/turbofreq (https://github.com/shimada-k/turbofreq)

Acquiring high-resolution time stamps
https://learn.microsoft.com/en-us/windows/win32/sysinfo/acquiring-high-resolution-time-stamps (https://learn.microsoft.com/en-us/windows/win32/sysinfo/acquiring-high-resolution-time-stamps)
Title: Re: rdtsc turbo mode
Post by: daydreamer on November 26, 2023, 06:19:47 PM
Running old games on my new laptop i use energy setting with disabled turbo ,because the old games are made to run as fast as possible

Wouldnt that be alternative measure with and without turbo and compare ?
Title: Re: rdtsc turbo mode
Post by: InfiniteLoop on November 26, 2023, 11:45:05 PM
Quote from: v0xX on November 25, 2023, 08:54:48 PMon amd u can read the freq through  SystemHypervisorSharedPageInformation
That code doesn't make sense. Hmm let me adjust it.
No joy. Its zero.
//SystemHypervisorSharedPageInformation = 0xC5 = 197
NTSTATUS(*GetInfo)(SYSTEM_INFORMATION_CLASS, PVOID, ULONG, PULONG);

HMODULE hh = LoadLibraryA("Ntdll.dll");
if (hh != INVALID_HANDLE_VALUE)
{
GetInfo = (NTSTATUS(*)(SYSTEM_INFORMATION_CLASS, PVOID, ULONG, PULONG))GetProcAddress(hh, "ZwQuerySystemInformation");
}
else
std::cout << "bad handle \r\n";
if (GetInfo == nullptr)
{
std::cout << "bad address\r\n";
}
else
{
unsigned long long* info;
unsigned int l = 0;
LARGE_INTEGER p;
NTSTATUS t;
t = GetInfo((SYSTEM_INFORMATION_CLASS)197, info, 0, (PULONG)&l);
if (t  > 0x7FFFFFFF)
std::cout << t << " error A\r\n";
else if (l > 0)
{
info = (unsigned long long*)_aligned_malloc(8 * l, 64);
for(int i = 0; i < l; i++)
info[i] = 0;
t = GetInfo((SYSTEM_INFORMATION_CLASS)197, info, l, nullptr);
if (t > 0x7FFFFFFF)
std::cout << t << " error B\r\n";
else
{
unsigned long long tsc = info[1];
std::cout << "TSC Frequ " << tsc << "\r\n";
}
_aligned_free(info);
}
}
CloseHandle(hh);
Title: Re: rdtsc turbo mode
Post by: v0xX on December 30, 2023, 11:01:09 PM
after some research here little fix for it :) disable intelppm.sys disable performance boost mode enable speedstep/turbo in bios intelppm not use the turbo of speedstep / only speedshift use it then set Attributes to 2 on HKEY_LOCAL_MACHINE\SYSTEM\CurrentControlSet\Control\Power\PowerSettings\54533251-82be-4824-96c1-47b60b740d00\be337238-0d82-4146-a960-4f3749d470c7 key rdtsc cycle correctly with  turbo without any issue  :skrewy: