News:

Masm32 SDK description, downloads and other helpful links
Message to All Guests
NB: Posting URL's See here: Posted URL Change

Main Menu

rdtsc turbo mode

Started by v0xX, November 21, 2023, 05:17:08 PM

Previous topic - Next topic

v0xX

Hello everyone,

I'm facing a technical challenge related to CPU cycle counting in game development. My current setup uses RDTSC to read the system's main clock, which is crucial for achieving precise timing in games. However, I'm encountering synchronization problems when Intel's Turbo Boost feature is active. Disabling Turbo Boost in the BIOS seems to resolve the issue, but it negatively impacts the base clock speed, which is not ideal for gaming performance.

To address this, I considered overclocking the BCLK to maintain high frame rates, but I'd prefer a solution that allows accurate RDTSC counting even with Turbo Boost on. Here's how I'm currently reading the CPU frequency, which is only effective with Turbo Boost off: typedef struct _PROCESSOR_POWER_INFORMATION {
ULONG Number;
ULONG MaxMhz;
ULONG CurrentMhz;
ULONG MhzLimit;
ULONG MaxIdleState;
ULONG CurrentIdleState;
} PROCESSOR_POWER_INFORMATION, * PPROCESSOR_POWER_INFORMATION;

typedef struct _SYSTEM_BASIC_INFORMATION
{
ULONG Reserved;
ULONG TimerResolution;
ULONG PageSize;
ULONG NumberOfPhysicalPages;
ULONG LowestPhysicalPageNumber;
ULONG HighestPhysicalPageNumber;
ULONG AllocationGranularity;
ULONG_PTR MinimumUserModeAddress;
ULONG_PTR MaximumUserModeAddress;
KAFFINITY ActiveProcessorsAffinityMask;
CCHAR NumberOfProcessors;
} SYSTEM_BASIC_INFORMATION, * PSYSTEM_BASIC_INFORMATION;

#define MHZ_IN_HZ       1000000
#define BTAG 'BCmN'

static uint64_t determine_cpu_cycle_rate(void)
{
NTSTATUS status;
ULONG num_proc;
SYSTEM_BASIC_INFORMATION sysinfo;
PPROCESSOR_POWER_INFORMATION proc_pwr_info;
uint64_t cpu_cycle_freq;

// Get number of logical processors
status = ZwQuerySystemInformation(SystemBasicInformation, &sysinfo, sizeof(sysinfo), NULL);
if (!NT_SUCCESS(status)) {
return 0;
}
num_proc = sysinfo.NumberOfProcessors;

// Allocate memory for processor power information
proc_pwr_info = (PPROCESSOR_POWER_INFORMATION)ExAllocatePoolWithTag(NonPagedPool, num_proc * sizeof(PROCESSOR_POWER_INFORMATION), BTAG);
if (proc_pwr_info == NULL) {
return 0;
}

// Get CPU information
status = ZwPowerInformation(ProcessorInformation, NULL, 0, proc_pwr_info, num_proc * sizeof(PROCESSOR_POWER_INFORMATION));
if (!NT_SUCCESS(status)) {
ExFreePoolWithTag(proc_pwr_info, BTAG);
return 0;
}

// Compute CPU max clock rate and tick length
cpu_cycle_freq = proc_pwr_info[0].CurrentMhz * MHZ_IN_HZ;

// Free the allocated memory
ExFreePoolWithTag(proc_pwr_info, BTAG);

return cpu_cycle_freq;
}
And this is my MASM code for reading RDTSC, using lfence to handle out-of-order execution issues on Intel platforms: .code

TSCGet proc
lfence
rdtsc
shl rdx,32
or rdx, rax
ret
TSCGet endp

end
now my question is How can I accurately count CPU cycles using RDTSC in a gaming environment with Turbo Boost enabled is it possible ? thanks  for help !

jj2007

Hi v0xX,

You probably know that rdtsc is a can of worms. Microsoft discourages its use and recommends QueryPerformanceCounter instead, which I use in my NanoTimer macro. It is pretty reliable but counts micro- or milliseconds, not cycles.

Another macro, CyCt*, uses a statistical approach to achieve more precise cycle counts, i.e. it cuts off outliers. Here is a typical plot for three different algos:



To understand your problem better, could you tell us in a few words how exactly you are using your cycle count, and why precision is important for you?

I attach the plot program. Just press arrow right to refresh the timings, it's hilarious what Windows is doing behind the scenes ;-)

P.S.: Welcome to the forum :thup:

v0xX

Hi

The crux of my implementation is converting these cycles into time measured in seconds. I do this by dividing the cycle count by the CPU's frequency. This division yields two values: time_seconds (the whole seconds) and time_fraction   (the fractional second part).  To avoid precision loss due to very high cycle counts, I've implemented a mechanism where the timer wraps more frequently (every week). This ensures that the fractional part of the time remains accurate.  This method is essential for my gaming, where timing accuracy down to microseconds is critical. and thats how i read rdtsc

Biterider

Hi v0xX
In your case, using the CPU RDTSC has some disadvantages. A better way is to use the high frequency interrupt counter. Some APIs use it, for example GetSystemTimePreciseAsFileTime.
Check this https://devblogs.microsoft.com/oldnewthing/20170921-00/?p=97057.

Another possibility is to read the shared kernel memory at 7FFE0000h, which holds the tick count with an accuracy of about 100ns.

Biterider

jj2007

Quote from: Biterider on November 22, 2023, 04:14:55 AMAnother possibility is to read the shared kernel memory at 7FFE0000h, which holds the tick count with an accuracy of about 100ns.

Where can I find that variable? I made a quick test, after reading Geoff Chappell's page, and it looks like there are the usual GetTickCount resolutions:

include \masm32\MasmBasic\MasmBasic.inc
  Init
  mov esi, 7FFE0000h
  PrintLine HexDump$(esi, 64)
  xor ecx, ecx
  .Repeat
    PrintLine Hex$([esi+8]), " ", Hex$([esi+14h]), " ", Hex$(rv(GetTickCount))
    push 999999*2  ; tiny delay
    .Repeat
        dec stack
    .Until Sign?
    pop edx
    inc ecx
  .Until ecx>30
EndOfCode

Output:
7FFE0000  00 00 00 00 00 00 A0 0F 51 FE 53 51 18 05 00 00 .......QSQ....
7FFE0010  18 05 00 00 89 23 B9 EC B5 1C DA 01 B5 1C DA 01 ....#....
7FFE0020  00 98 3B 9E F7 FF FF FF F7 FF FF FF 64 86 64 86 .;dd
7FFE0030  43 00 3A 00 5C 00 57 00 69 00 6E 00 64 00 6F 00 C.:.\.W.i.n.d.o.

5153FE51 ECB92389 2163F9EB
51570548 ECBC2A92 2163F9FB
51570548 ECBC2A92 2163F9FB
5158F1F9 ECBE174E 2163FA0B
5159B54C ECBEDAA5 2163FA1A
515A5186 ECBF76E3 2163FA1A
515B99D3 ECC0BF37 2163FA1A
515B99D3 ECC0BF37 2163FA1A
515B99D3 ECC0BF37 2163FA1A
515E338C ECC358FF 2163FA2A
515ECFC6 ECC3F53D 2163FA3A
515F9322 ECC4B89D 2163FA3A
51607D91 ECC5A311 2163FA3A
5160D5E2 ECC5FB64 2163FA3A
5160D5E2 ECC5FB64 2163FA3A
5162A0F5 ECC7C682 2163FA49
5162A0F5 ECC7C682 2163FA49
5162A0F5 ECC7C682 2163FA49
5165034B ECCA28E6 2163FA59
51664D5B ECCB72FD 2163FA68
51664D5B ECCB72FD 2163FA68
51664D5B ECCB72FD 2163FA68
5168775E ECCD9D0C 2163FA78
5168775E ECCD9D0C 2163FA78
5168775E ECCD9D0C 2163FA78
516AF522 ECD01ADF 2163FA88
516AF522 ECD01ADF 2163FA88
516AF522 ECD01ADF 2163FA88
516D4B43 ECD2710D 2163FA97
516E0E7A ECD33449 2163FA97
516EAACB ECD3D09D 2163FA97

As you can see, the two counters at +8 and +14h are not exactly synchronised between them and with GetTickCount's low dword, but despite the tiny delay, they don't get updated between two or three loop iterations. With 999999*1, you get 5 to 10 identical readings. It's the standard 16ms resolution of GetTickCount, unfortunately :sad:

v0xX

u can read through masm this one is 64 bit GetTickCount works in kernel/usermode


.code
MyGetTickCount64Kernel32 proc
                mov     ecx, dword ptr [7FFE0004h]
                mov     eax, 7FFE0320h
                shl     rcx, 20h
                mov     rax, qword ptr [rax]
                shl     rax, 8
                mul     rcx
                mov     rax, rdx
                ret
MyGetTickCount64Kernel32 endp
end

jj2007

Sure, but it's the same pattern:
  xor ebx, ebx
@@:
  PrintLine Hex$(rv(MyGetTickCount64Kernel32)), Tb$, Hex$(rv(GetTickCount))
  xor ecx, ecx
x1:    inc ecx
    cmp ecx, 999999
    jb x1
  inc ebx
  cmp ebx, 20
  jb @B

Output:
2404592 21804592
2404592 21804592
2404592 21804592
2404592 218045a2
24045a2 218045a2
24045a2 218045a2
24045a2 218045a2
24045a2 218045a2
24045a2 218045a2
24045a2 218045a2
24045a2 218045a2
24045a2 218045a2
24045a2 218045a2
24045a2 218045a2
24045a2 218045b2
24045b2 218045b2
24045b2 218045b2
24045b2 218045b2
24045b2 218045b2
24045b2 218045b2

Standard GetTickCount resolution, unusable for your project. Try QueryPerformanceCounter, Micros*t says it returns nanoseconds, no matter what tricks your CPU is playing on you (throttling, turbo, ...).

v0xX

Quote from: jj2007 on November 22, 2023, 07:35:57 AMSure, but it's the same pattern:
  xor ebx, ebx
@@:
  PrintLine Hex$(rv(MyGetTickCount64Kernel32)), Tb$, Hex$(rv(GetTickCount))
  xor ecx, ecx
x1:    inc ecx
    cmp ecx, 999999
    jb x1
  inc ebx
  cmp ebx, 20
  jb @B

Output:
2404592 21804592
2404592 21804592
2404592 21804592
2404592 218045a2
24045a2 218045a2
24045a2 218045a2
24045a2 218045a2
24045a2 218045a2
24045a2 218045a2
24045a2 218045a2
24045a2 218045a2
24045a2 218045a2
24045a2 218045a2
24045a2 218045a2
24045a2 218045b2
24045b2 218045b2
24045b2 218045b2
24045b2 218045b2
24045b2 218045b2
24045b2 218045b2

Standard GetTickCount resolution, unusable for your project. Try QueryPerformanceCounter, Micros*t says it returns nanoseconds, no matter what tricks your CPU is playing on you (throttling, turbo, ...).

I've been using ZwQueryPerformanceCounter, which functions similarly to QueryPerformanceCounter. However, I'm exploring the rdtsc instruction for its superior precision.

My main challenge is dealing with the CPU's Turbo Boost feature, which seems to affect the accuracy of rdtsc. I suspect that reading the base frequency directly might bypass the Turbo Boost fluctuations. Here's the approach I'm considering: unsigned long tsc_freq_mhz(void) {
    msr_t msr = rdmsr(0xce);
    return (BASE_CLOCK_MHZ * ((msr.lo >> 8) & 0xff)) * 1000000;
}
  With this code, I aim to calculate the TSC frequency based on the MSR (Model Specific Register).  but still not sure if it will fluctuation

Biterider

Hi v0xX
Since Nehalem, the CPUs support the invariant TSC

From IntelĀ® 64 and IA-32 Architectures Software Developer's Manual

Quote18.17.1 Invariant TSC
The time stamp counter in newer processors may support an enhancement, referred to as invariant TSC. Processor's support for invariant TSC is indicated by CPUID.80000007H:EDX[8]. The invariant TSC will run at a constant rate in all ACPI P-, C-. and T-states. This is the architectural behavior moving forward. On processors with invariant TSC support, the OS may use the TSC for wall clock timer services (instead of ACPI or HPET timers). TSC reads are much more efficient and do not incur the overhead associated with a ring transition or access to a platform resource.
and
Quote18.17.4 Invariant Time-Keeping
The invariant TSC is based on the invariant timekeeping hardware (called Always Running Timer or ART), that runs at the core crystal clock frequency. The ratio defined by CPUID leaf 15H expresses the frequency relationship between the ART hardware and TSC. If CPUID.15H:EBX[31:0] != 0 and CPUID.80000007H:EDX[InvariantTSC] = 1, the following linearity relationship holds between TSC and the ART hardware: TSC_Value = (ART_Value * CPUID.15H:EBX[31:0] )/ CPUID.15H:EAX[31:0] + K Where 'K' is an offset that can be adjusted by a privileged agent3. When ART hardware is reset, both invariant TSC and K are also reset.

Biterider

jj2007

Quote from: v0xX on November 21, 2023, 05:17:08 PMHow can I accurately count CPU cycles using RDTSC in a gaming environment with Turbo Boost enabled is it possible ?

I'm not a gamer, so pardon me if what I write looks silly:

- are you interested in cycle counts, irrespectively of cpu speed (turbo boost, throttling)? In that case, rdtsc preceded by cpuid would be the way to go (beware of core switches, beware of interrupts);

- or do you want time, i.e. nanoseconds between two events? Then QueryPerformance* is probably the best way to go (no idea how it treats interrupts, though).

In any case, you urgently need a testbed.

v0xX

Quote from: jj2007 on November 22, 2023, 09:18:29 AM
Quote from: v0xX on November 21, 2023, 05:17:08 PMHow can I accurately count CPU cycles using RDTSC in a gaming environment with Turbo Boost enabled is it possible ?

I'm not a gamer, so pardon me if what I write looks silly:

- are you interested in cycle counts, irrespectively of cpu speed (turbo boost, throttling)? In that case, rdtsc preceded by cpuid would be the way to go (beware of core switches, beware of interrupts);

- or do you want time, i.e. nanoseconds between two events? Then QueryPerformance* is probably the best way to go (no idea how it treats interrupts, though).

In any case, you urgently need a testbed.

Thank you for your suggestions and insights. My focus is indeed on measuring cycle counts, and for this, I'm using the rdtsc instruction preceded by cpuid to ensure accuracy in my measurements. I am also aware of the potential issues related to core switches and interrupts, and I'll be keeping these in mind as I proceed thanks !

jj2007

Quote from: v0xX on November 22, 2023, 09:37:24 AMMy focus is indeed on measuring cycle counts, and for this, I'm using the rdtsc instruction preceded by cpuid

Ok. In that case, consider a "statistical" approach, i.e. take a hundred readings and eliminate the slowest 10% - see the image above in reply #1. Otherwise you will never get consistent results from rdtsc, mainly due to context switches.

v0xX

figure out its nothing to with turbo boost  I've been working on addressing the issue of pipeline unsynchronization when reading the Time Stamp Counter (TSC) on our system. Initially, I tried two different methods, but they both resulted in inconsistent cycle counts.

TSCGet proc
    lfence
    rdtsc
    shl rdx, 32
    or rax, rdx
    ret
TSCGet endp

In this approach, I used LFENCE (Load Fence) followed by RDTSC (Read Time Stamp Counter). LFENCE was intended to act as a barrier for earlier reads, but it didn't provide full serialization, leading to some inconsistencies in the TSC reading due to out-of-order execution.  Second Method - TSCGet with Double LFENCE: TSCGet proc
    lfence
    rdtsc
    lfence
    shl rdx, 32
    or rax, rdx
    ret
TSCGet endp
  Here, I added an additional LFENCE after RDTSC. While this helped in mitigating some out-of-order execution issues, it still wasn't completely reliable for our needs.  Finally, I developed a more robust solution that effectively fixed the pipeline synchronization issue:

align 16
TSCGet proc
    push rbx
    sub eax, eax
    cpuid
    rdtsc
    shl rdx, 32
    or rax, rdx
    push rax
    sub eax, eax
    cpuid
    pop rax
    pop rbx
    ret
TSCGet endp
end
This method involves using the CPUID instruction before and after RDTSC. The CPUID instruction acts as a complete serializing instruction, ensuring that every instruction preceding it is fully complete and the instruction pipeline is flushed. This guarantees that when RDTSC is called, it's done in a stable and synchronized state. By incorporating CPUID, I was able to get consistent and accurate readings of the TSC, resolving the pipeline synchronization issues we were facing." thanks for help guys !

InfiniteLoop

RDTSC is only affected by base frequency overclocking and is usually the same as the base clock.
On Intel its inside CPUID. AMD has no straight-forward means of finding the frequency.

v0xX

Quote from: InfiniteLoop on November 24, 2023, 04:54:04 PMRDTSC is only affected by base frequency overclocking and is usually the same as the base clock.
On Intel its inside CPUID. AMD has no straight-forward means of finding the frequency.

on amd u can read the freq through  SystemHypervisorSharedPageInformation

static inline uint64_t get_rdtsc_freq(void) {


static uint64_t tsc_freq = 0;


volatile uint64_t* hypervisor_shared_page = NULL;
unsigned int size = 0;

// SystemHypervisorSharedPageInformation == 0xc5
int result = ZwQuerySystemInformation(SystemHypervisorSharedPageInformation, (void*)&hypervisor_shared_page, sizeof(hypervisor_shared_page), (PULONG)&size);

// success
if (size == sizeof(hypervisor_shared_page) && result >= 0) {
// docs say ReferenceTime = ((VirtualTsc * TscScale) >> 64)
//      set ReferenceTime = 10000000 = 1 second @ 10MHz, solve for VirtualTsc
//       =>    VirtualTsc = 10000000 / (TscScale >> 64)
tsc_freq = (10000000ull << 32) / (hypervisor_shared_page[1] >> 32);
// If your build configuration supports 128 bit arithmetic, do this:
// tsc_freq = ((unsigned __int128)10000000ull << (unsigned __int128)64ull) / hypervisor_shared_page[1];
return tsc_freq;
}
}