RDTSC

Jokaste · November 08, 2017, 08:43:45 AM

Hello Tout le Monde!

I would like to make a test of RDTSC.
I created a small function whil only reset RCX.
I tested on a non aligned function and on an aligned (16 bytes boundaries) version.
Result are very strange, once the aligned version is quicker and the other time it is lower!
Have you an explication or a better way to make the tests.
Thanks

Quote

RDTSC
XOR RCX,RCX
--------------------------------------------------------------------
---------------> NOT ALIGNED

RDX:RAX = 00002E8F13E974B7 00002EDF94E3C13A 00002EE3DFE7A9B2 00002EE70A0A7F63 00002EEAF8FCC41A
R9:R8 = 0002E8E8B356408 00002EDECC1D4E3F 00002EE37163E41F 00002EE704692472 00002EEA7BF407D8
RAX - R8 = 2 293 502 127 3 368 448 763 1 854 129 555 94 460 657 2 097 724 482
LOOP = 10 000 0000

mov eax,10 000 000
RDTSC
mov r8,rax
mov r9,rdx
@Loop :
XOR rcx,rcx
dec eax
jnz SHORT @Loop
RDTSC
shl rdx,32
or rax,rdx
shl r9,32
or r8,r9
nop
---------------> 16 BYTES ALIGNED
RDX:RAX= 00002E80EB676582 00002EC9E986FD8F 00002ECE11AE236D 00002ED2778F7E13 00002ED60C1B7824
R9:R8 = 00002E8076A8B087 00002EC8F57DF4B3 00002ECE0779193E 00002ED1BC1E10CE 00002ED58727E171
RAX - R8 = 1 958 655 227 4 094 232 796 171 248 175 3 144 772 933 2 230 556 339
LOOP = 10 000 000
--------------------------------------------------------------------
mov eax,1 0000 000
RDTSC
mov r8,rax
mov r9,rdx
jmp SHORT @Loop
ALIGN 16
@Loop :
XOR rcx,rcx
dec eax
jnz SHORT @Loop
RDTSC
shl rdx,32
or rax,rdx
shl r9,32
or r8,r9
nop

aw27 · November 09, 2017, 03:03:15 AM

I don't understand very well what is your point, but have a look here:
http://masm32.com/board/index.php?topic=49.0

Jokaste · November 09, 2017, 04:44:58 AM

I am looking for ideas to measure cpu cycles. I have too stange results.

aw27 · November 09, 2017, 05:01:38 AM

Code alignment does not matter it may even be counterproductive.
What counts is cache, branch prediction and pipelining.

Jokaste · November 09, 2017, 05:37:58 AM

Where could I find the instructions list ordered by pipeline. I did not see them from AMD and INTEL?

aw27 · November 09, 2017, 06:17:12 AM

Quote from: Jokaste on November 09, 2017, 05:37:58 AM
Where could I find the instructions list ordered by pipeline. I did not see them from AMD and INTEL?

You will not find them, nowadays things change from processor to processor. What I said is that code alignment does not matter and can even be counterproductive.

hutch-- · November 14, 2017, 04:32:22 PM

Much the same comment, aligning code on modern hardware more often than not ends up slower. Aligning data is another issue altogether, it is critical in many instances and you align at the minimum to the data size to be read or written. Clock cycles are a left over from ancient hardware, if you had a DX386 or earlier, you would count clock cycles, when the DX 486 introduced the first x86 pipeline, speed shifted from cycle counting to instruction scheduling. Later processors have multiple pipelines so scheduling is even more important. Internally x86 is RISC but it provides a CISC interface of x86/x64 instructions and this gives you a preferred instruction set, this is why you stay away from the ancient 8088/86 instructions and use the simple fast ones.

Speed Testing !!!!
The only one that matters is REAL TIME[tm] Testing. Cycle counts are a waste of time.

Jokaste · November 15, 2017, 01:30:18 AM

We have no choice for aligning datas because of using some instructions results in a crash.
I remember the time where I select an instruction to execute into the U pipeline and the other into the V pipeline.
It seems that we have nothing to do for optimizing the code, except prefetch and branch prediction.
It is poor.
If someone could make a great post these that would help me, Intel and Amd are not easy to uunderstand, even using a translation program.
Thanks Hutch

felipe · November 15, 2017, 12:06:28 PM

Nice info hutch.
:icon14: :icon14:

hutch-- · November 15, 2017, 11:29:13 PM

Its not an easy subject to address in a small space but assembler instructions (mnemonics) roughly break up into 3 classes, the antique slow ones that are stored in slow microcode, the RISC style of simple integer instructions (add, sub test etc ....) that are used for general purpose code layout and messy stuff that you cannot do with the later and much faster SSE, AVX and a few other classes of later instructions.

Over time Intel have shifted processing power more and more towards the later go fast stuff as the general purpose instructions are mainly used for addressing which has to be done to manipulate the later much larger instructions. On a PIV, SSE was not all that fast but on later Core2 then the sequence of i7 processors, the SSE, AVX and other later types got a lot faster as more silicon was dedicated to their operation. There are a few special cases like REP MOVSQ that are fast enough but the shift keeps going in the direction of the later much faster stuff.

In particular the older ideas of optimisation matter less and less with later instructions as the action is in getting the instructions scheduled in an efficient order so that the much larger data sizes are stacked efficiently. Instruction count and size matter little these days as the world has changed, with 16 bit real mode DOS, 1 meg was a big deal, win3 was a weird hybrid in a couple of forms due to very limited memory and a 16 meg address range, win32 has a theoretical 4 gig range that you could only normally use 2 gig and win64 can address up to about 128 gig due to limitations in the size of the page table controlled by the OS.

The rough distinction here is that cute trick on DOS com files simply do not matter in win32/64, using an extra meg simply does not matter in the range of multiple gigabytes. If you can get a performance gain by using more memory, or expanding a technique into a table or any of a number of conventional tricks, do it as the size does not matter. When you have to work in BYTE data as many traditional algorithms must do, you can still write fast code if you do the normal things, minimum instruction counts, careful choice of instructions and good algorithm design.

felipe · November 16, 2017, 10:32:26 AM

Ok, so it seems than aligning data in the more recent hardware (intel microprocessors) it is still a good way for optimizing the code, but aligning labels for jumps or procedures are not.

Please correct me if i'm wrong.

Jokaste · November 16, 2017, 06:17:47 PM

I have looked many programs (even in the windows directory), and many of them align the code on a 16 byte boundary. If they do that there is a reason. In Intel and AMD support files they also tell to align code and datas. I contnue the research. Thanks for your answers. :t

aw27 · November 16, 2017, 06:39:11 PM

Quote from: Jokaste on November 16, 2017, 06:17:47 PM
I have looked many programs (even in the windows directory), and many of them align the code on a 16 byte boundary. If they do that there is a reason. In Intel and AMD support files they also tell to align code and datas. I contnue the research. Thanks for your answers. :t

I hope you will come up with some code that supports your beliefs.

The MASM Forum

News:

RDTSC

Jokaste

aw27

Jokaste

aw27

Jokaste

aw27

hutch--

Jokaste

felipe

hutch--

felipe

Jokaste

aw27