Hi all!
Following Gabriele Paoloni, 2010: How to Benchmark Code Execution Times on Intel IA-32 and IA-64 Instruction Set Architectures. (http://www.intel.com/content/dam/www/public/us/en/documents/white-papers/ia-32-ia-64-benchmark-code-execution-paper.pdf (http://www.intel.com/content/dam/www/public/us/en/documents/white-papers/ia-32-ia-64-benchmark-code-execution-paper.pdf)) I adapted to UEFI the measurement core:
mov r10, pBootServices
invoke [r10].EFI_BOOT_SERVICES.RaiseTPL,TPL_HIGH_LEVEL
mov Old_TPL, rax
CPUID
RDTSC
mov cycles_high , rdx
mov cycles_low , rax
; invoke measured_loop, j
RDTSCP
mov cycles_high1 , rdx
mov cycles_low1 , rax
CPUID
mov r10, pBootServices
invoke [r10].EFI_BOOT_SERVICES.RestoreTPL, Old_TPL
One Paoloni trick is to call (or not) a loop function then you can make estimations from minimum values using lineal regression:
measured_loop proc loops:qword
local i : qword
ForLp i, 0, loops
function_under_glass1
Next i
ret
measured_loop endp
The target code for the test is that used by Gunther in Count cycles: test reports needed (http://masm32.com/board/index.php?topic=10087.0):function_under_glass1 macro
fld qword ptr [numerator]
fdiv qword ptr [denominator]
fdiv qword ptr [denominator]
fdiv qword ptr [denominator]
fdiv qword ptr [denominator]
fstp qword ptr [result]
endm
To run the test, PC booted from a FAT32 formated USB. *.efi files must be renamed to BOOTX64.efi and copied to /efi/boot/ directory in USB.
For first step the program is assembled with invocation to measured_loop commented (BenchmarkUEFI0.efi):
total number of spurious min values = 240
total variance = 0.5871
absolute max deviation = 8.000
variance of variances = 0.0002
lineal equation: total cycles = 30.8239 - 0.001 * repetitions
For second step the program is assembled making the invocation to measured_loop (BenchmarkUEFI2.efi):
total number of spurious min values = 1
total variance = 187.5673
absolute max deviation = 82.000
variance of variances = 25560.41
lineal equation: total cycles = 75.6068 + 18.0053 * repetitions
From these equations and rounding values you expect in this machine:
31 cycles overhead of measurement
45 cycles overhead of calling the loop function
18 cycles target code execution
----------- ------------------------------------------
94 cycles total measurement of single execution
Regards, HSE.
Hi HSE
That's a really interesting idea. :thumbsup:
I'm sure that replacing openSUSE with UEFI is a good thing, with potentially even better results.
I'm still trying to convince VirtualBox to play along, but as soon as it works I'll check.
Biterider
Hi Biterider
Quote from: Biterider on May 30, 2022, 01:51:14 AM
I'm sure that replacing openSUSE with UEFI is a good thing, with potentially even better results.
My idea was to test applications running from UEFI, then is the only option :biggrin:
Quote from: Biterider on May 30, 2022, 01:51:14 AM
I'm still trying to convince VirtualBox to play along, but as soon as it works I'll check.
:thumbsup:
This kind of timing in VirtualBox from USB it's a disaster. Using the clock for long processes look good.
To obtain reliable results this programs
must run rebooting the PC from the USB.
HSE
Hi HSE
I'm trying to build your Benchmark UEFI application, but the PrintC macro seems to be missing.
Regards, Biterider
Hi Biterider!!
PrintC is in efiutil.inc (https://github.com/ASMHSE/ObjAsm-C.1uefi/blob/f5f39ebb55a1c1ebf92feeab7b7b9d52e51ff7c0/Code/Inc/UEFI/efiUtil.inc)
Regards, HSE
Hi
I've made some progress on this topic and discovered some very useful things.
The Paoloni approach is pure gold (if done right). HSE explained it very well in the first post. Reading Paoloni's paper is highly recommended.
I created some tests with known results to track down some bugs and after fixing them I switched back to code timing. Comparing the result with the UEFI approach, I have a very close match with Agner Fog timing tables for my CPU model (+/-0.2 cycles max).
In order to debug the code with my usual debugger, I adapted the code to run on Windows. To my surprise, although the variance has increased noticeably, the results are also pretty good. This is really good news. This means that for non-ultra precise measurements (+/-1 cycle), this implementation can also be used on Windows. :thumbsup:
I'll refine the code and post it in the next days. At the moment it outputs the results to the DebugCenter.
Biterider
Congratulations :thumbsup:
Quote from: HSE on May 30, 2022, 12:42:59 AM mov r10, pBootServices
invoke [r10].EFI_BOOT_SERVICES.RaiseTPL,TPL_HIGH_LEVEL
mov Old_TPL, rax
CPUID
RDTSC
mov cycles_high , rdx
mov cycles_low , rax
That looks somewhat similar to my CyCt* macros:
push ebx
cpuid ; serialize; trashes ecx+edi
pop ebx
rdtsc
mov CyCtrdtsc, eax ; these cost a few cycles
mov CyCtrdtsc[4], edx ; that we will subtract later
I doubt, however, that you'll get stable timings from that - it's more complicated.
Btw...
mov cycles_high , edx
mov cycles_low , eax
...would be more consistent. The upper DWORDs aren't affected by rdtsc.
Hi HSE
Quote from: HSE on July 01, 2022, 06:32:48 AM
Congratulations :thumbsup:
You deserve it. You found it and it was your idea to combine UEFI & Paoloni. :cool:
Biterider
Hi JJ!
CyCt macros differ from Paoloni code after target code, wich is:
; invoke measured_loop, j
RDTSCP
mov cycles_high1 , rdx
mov cycles_low1 , rax
CPUID
Quote from: jj2007 on July 01, 2022, 06:40:12 AM
The upper DWORDs aren't affected by rdtsc.
In 64 bits, upper DWORD is cleared.
I think that kind of address in CyCt is not going to work with largeaddressaware.
Hi Biterider!
Quote from: Biterider on July 01, 2022, 06:52:07 AM
You deserve it.
I failed into find differences between my code and what you posted in repository, and I can't maked that work :biggrin:
Quote from: HSE on July 01, 2022, 07:17:06 AMIn 64 bits, upper DWORD is cleared.
Yes, that's why you should ideally use
RDTSCP
shl rdx, 32
or rax, rdx
mov cycles, rax ; a QWORD
CPUID
Now one question is "why cpuid after rdtsc?" I've seen this also in a SOF answer:
QuoteI suggest you read this whitepaper from intel if you are considering using this for benchmarking etc: download.intel.com/embedded/software/IA/324264.pdf (it actually shows that you need both rdtsc + cpuid and rdtscp + cpuid for correct measurements) –
Necrolis
Sep 28, 2012 at 7:25
But it doesn't make sense to me: why use the serialising instruction
after the measurement?
Btw this is not to criticise your code, which is very good - I am trying to find the
one correct solution for my 32-bit CyCt* macros. Even at SOF they are struggling (https://stackoverflow.com/questions/12631856/difference-between-rdtscp-rdtsc-memory-and-cpuid-rdtsc) :cool:
Quote from: jj2007 on July 01, 2022, 07:36:34 AM
But it doesn't make sense to me: why use the serialising instruction after the measurement?
Because in the end your are also serialising RDTSCP. Still can happen that RDTSCP is executed after "mov cycles_high1 , rdx; mov cycles_low1 , rax" and you obtain an spurius value, but probability is lower. That was Paoloni work, less spurious values were obtained in that way.
Quote from: jj2007 on July 01, 2022, 07:36:34 AM
RDTSCP
shl rdx, 32
or rax, rdx
mov cycles, rax ; a QWORD
CPUID
That is one more instruction inside the critical code (I called that
The glass). (for magnification glass, you know :biggrin:)
Quote from: HSE on July 01, 2022, 07:49:50 AMThat is one more instruction inside the critical code
But one less memory move...
Quote from: jj2007 on July 01, 2022, 08:58:41 AM
But one less memory move...
This is technology, not science. It's not much about logic than about measured deviations. You can test if that result in less deviations, perhaps you are wright :thumbsup: