Author Topic: Saarland Informatics Campus - LATENCY, THROUGHPUT AND PORT USAGE INFORMATION  (Read 486 times)

LiaoMi

  • Member
  • ****
  • Posts: 925
uops.info - Saarland Informatics Campus

LATENCY, THROUGHPUT, AND PORT USAGE INFORMATION
FOR INSTRUCTIONS ON RECENT X86 MICROARCHITECTURES

This website provides more than 500,000 pages with detailed latency, throughput, and port usage data for most instructions on many recent x86 microarchitectures. While such data is important for understanding, predicting, and optimizing the performance of software running on these microarchitectures, most of it is not documented in the official processor manuals.

Table - https://uops.info/table.html
We provide an interactive HTML table with latency, throughput, and port usage data for all tested microarchitectures, both as measured on the actual hardware, and as obtained from running our microbenchmarks on top of Intel IACA.

Each value in the table comes with a link to web page with details on the microbechmarks that were used to obtain this value. For the latency, this page contains separate data for each pair of input and output operands. In total, the table contains links to more than 400,000 pages.

XML File - https://uops.info/xml.html
The results of our microbenchmarks are available in a machine-readable XML file. The file contains the results for all tested microarchitectures, both as measured on the actual hardware, and as obtained from running our microbenchmarks on top of Intel IACA.

Furthermore, the file also contains detailed information on the operands of each instruction, which was obtained from the configuration files of Intel's X86 Encoder Decoder (XED) library. We provide a brief Python script that shows how this information can be used to automatically generate assembler code for all x86 instructions.

https://uops.info/instructions.xml - (March 2021, current version)
https://raw.githubusercontent.com/andreas-abel/XED-to-XML/master/xmlToAssembler.py

Caches
On this page, we provide details on the caches of some of the processors that we analyzed. In particular, we provide information on the cache replacement policies, which are undocumented in the official manuals.

The results were obtained using the nanoBench Cache Analyzer, which is available on GitHub. The repository also contains a simulator for all the policies described on this page. Further information on the policies can be found in our paper nanoBench: A Low-Overhead Tool for Running Microbenchmarks on x86 Systems and in Andreas Abel's PhD thesis Automatic Generation of Models of Microarchitectures.

We provide results for the following processors:

Core i5-1035G1 (Ice Lake)
Core i3-8121U (Cannon Lake)
Core i7-8700K (Coffee Lake)
Core i7-7700 (Kaby Lake)
Core i7-6500U (Skylake)
Core i5-5200U (Broadwell)
Xeon E3-1225 v3 (Haswell)
Core i5-3470 (Ivy Bridge)
Core i7-2600 (Sandy Bridge)
Core i5-650 (Westmere)
Core i5-750 (Nehalem)
Core 2 Duo E8400 (Wolfdale)
Core 2 Duo E6750 (Conroe)

Here https://uops.info/cache/lat_combined.html, we provide a set of graphs for all CPUs showing the latencies when accessing memory areas of different sizes.

AsmGrid - https://asmjit.com/asmgrid/
Latency and throughput data for several recent x86 CPUs. The data was obtained using the CULT project.

Example -
https://uops.info/html-lat/HSW/VADDSUBPS_YMM_YMM_M256-Measurements.html

Code: [Select]
VADDSUBPS (YMM, YMM, M256) - Latency
Operands
Operand 1 (w): Register (YMM0, YMM1, YMM2, YMM3, YMM4, YMM5, YMM6, YMM7, YMM8, YMM9, YMM10, YMM11, YMM12, YMM13, YMM14, YMM15)
Operand 2 (r): Register (YMM0, YMM1, YMM2, YMM3, YMM4, YMM5, YMM6, YMM7, YMM8, YMM9, YMM10, YMM11, YMM12, YMM13, YMM14, YMM15)
Operand 3 (r): Memory
Latency operand 2 → 1: 3
Latency operand 3 → 1 (address, base register): ≤10
Latency operand 3 → 1 (address, index register): ≤10
Latency operand 3 → 1 (memory): ≤9
Latency operand 2 → 1: 3
Experiment 1
Instruction: VADDSUBPS YMM1, YMM0, ymmword ptr [R14]
Chain instructions: VSHUFPD YMM0, YMM1, YMM1, 0;VSHUFPD YMM0, YMM0, YMM0, 0;VSHUFPD YMM0, YMM0, YMM0, 0;VSHUFPD YMM0, YMM0, YMM0, 0;VSHUFPD YMM0, YMM0, YMM0, 0
Chain latency: 5
Code:
   0: c4 c1 7f d0 0e        vaddsubps ymm1,ymm0,YMMWORD PTR [r14]
   5: c5 f5 c6 c1 00        vshufpd ymm0,ymm1,ymm1,0x0
   a: c5 fd c6 c0 00        vshufpd ymm0,ymm0,ymm0,0x0
   f: c5 fd c6 c0 00        vshufpd ymm0,ymm0,ymm0,0x0
  14: c5 fd c6 c0 00        vshufpd ymm0,ymm0,ymm0,0x0
  19: c5 fd c6 c0 00        vshufpd ymm0,ymm0,ymm0,0x0
Init:
MOV R15, 10000;
L: VADDPS YMM0, YMM1, YMM1;
VADDPS YMM0, YMM1, YMM1;
DEC R15;
JNZ L;
VZEROALL;
MOV RAX, 0x4000000040000000;
MOV [R14+0], RAX;
MOV [R14+8], RAX;
MOV [R14+16], RAX;
MOV [R14+24], RAX;
VMOVUPD YMM0, [R14]
Show nanoBench command
Results:
Instructions retired: 6.0
Core cycles: 8.0
Reference cycles: 8.05
UOPS_EXECUTED.CORE: 7.0
Experiment 2 (source registers initialized by an instruction of the same kind)
Instruction: VADDSUBPS YMM1, YMM0, ymmword ptr [R14]
Chain instructions: VSHUFPD YMM0, YMM1, YMM1, 0;VSHUFPD YMM0, YMM0, YMM0, 0;VSHUFPD YMM0, YMM0, YMM0, 0;VSHUFPD YMM0, YMM0, YMM0, 0;VSHUFPD YMM0, YMM0, YMM0, 0
Chain latency: 5
Code:
   0: c4 c1 7f d0 0e        vaddsubps ymm1,ymm0,YMMWORD PTR [r14]
   5: c5 f5 c6 c1 00        vshufpd ymm0,ymm1,ymm1,0x0
   a: c5 fd c6 c0 00        vshufpd ymm0,ymm0,ymm0,0x0
   f: c5 fd c6 c0 00        vshufpd ymm0,ymm0,ymm0,0x0
  14: c5 fd c6 c0 00        vshufpd ymm0,ymm0,ymm0,0x0
  19: c5 fd c6 c0 00        vshufpd ymm0,ymm0,ymm0,0x0
Init:
MOV R15, 10000;
L: VADDPS YMM0, YMM1, YMM1;
VADDPS YMM0, YMM1, YMM1;
DEC R15;
JNZ L;
VZEROALL;
MOV RAX, 0x4000000040000000;
MOV [R14+0], RAX;
MOV [R14+8], RAX;
MOV [R14+16], RAX;
MOV [R14+24], RAX;
VMOVUPD YMM0, [R14];
VADDSUBPS YMM0, YMM1, ymmword ptr [R14]
Show nanoBench command
Results:
Instructions retired: 6.0
Core cycles: 8.0
Reference cycles: 8.0
UOPS_EXECUTED.CORE: 7.0

daydreamer

  • Member
  • *****
  • Posts: 1750
  • building nextdoor
Thanks LiaoMi  :thumbsup:
SIMD fan and macro fan
I am an Invoker
"An Invoker is a mage who specializes in the manipulation of raw and elemental energies."