News:

Masm32 SDK description, downloads and other helpful links
Message to All Guests
NB: Posting URL's See here: Posted URL Change

Main Menu

RDTSC

Started by Jokaste, November 08, 2017, 08:43:45 AM

Previous topic - Next topic

Jokaste

Hello Tout le Monde!

I would like to make a test of RDTSC.
I created a small function whil only reset RCX.
I tested on a non aligned function and on an aligned (16 bytes boundaries) version.
Result are very strange, once the aligned version is quicker and the other time it is lower!
Have you an explication or a better way to make the tests.
Thanks

Quote

RDTSC
XOR RCX,RCX
--------------------------------------------------------------------
---------------> NOT ALIGNED

RDX:RAX      = 00002E8F13E974B7   00002EDF94E3C13A   00002EE3DFE7A9B2   00002EE70A0A7F63   00002EEAF8FCC41A
R9:R8      = 0002E8E8B356408   00002EDECC1D4E3F   00002EE37163E41F   00002EE704692472   00002EEA7BF407D8
RAX - R8   = 2 293 502 127      3 368 448 763         1 854 129 555         94 460 657      2 097 724 482
LOOP = 10 000 0000


mov   eax,10 000 000
RDTSC
mov   r8,rax
mov   r9,rdx
@Loop :
XOR rcx,rcx
dec eax
jnz SHORT @Loop
RDTSC
shl   rdx,32
or   rax,rdx
shl   r9,32
or   r8,r9
nop
---------------> 16 BYTES ALIGNED
RDX:RAX= 00002E80EB676582 00002EC9E986FD8F 00002ECE11AE236D 00002ED2778F7E13 00002ED60C1B7824
R9:R8      = 00002E8076A8B087   00002EC8F57DF4B3   00002ECE0779193E   00002ED1BC1E10CE   00002ED58727E171
RAX - R8   = 1 958 655 227      4 094 232 796         171 248 175         3 144 772 933      2 230 556 339
LOOP =  10 000 000
--------------------------------------------------------------------
mov eax,1 0000 000
RDTSC
mov   r8,rax
mov   r9,rdx
jmp SHORT @Loop
ALIGN 16
@Loop :
XOR rcx,rcx
dec eax
jnz SHORT @Loop
RDTSC
shl   rdx,32
or   rax,rdx
shl   r9,32
or   r8,r9
nop
Kenavo
---------------------------
Grincheux / Jokaste

aw27

I don't understand very well what is your point, but have a look here:
http://masm32.com/board/index.php?topic=49.0

Jokaste

I am looking for ideas to measure cpu cycles. I have too stange results.
Kenavo
---------------------------
Grincheux / Jokaste

aw27

Code alignment does not matter it may even be counterproductive.
What counts is cache, branch prediction and pipelining.

Jokaste

Where could I find the instructions list ordered by pipeline. I did not see them from AMD and INTEL?
Kenavo
---------------------------
Grincheux / Jokaste

aw27

Quote from: Jokaste on November 09, 2017, 05:37:58 AM
Where could I find the instructions list ordered by pipeline. I did not see them from AMD and INTEL?
You will not find them, nowadays things change from processor to processor. What I said is that code alignment does not matter and can even be counterproductive.

hutch--

Much the same comment, aligning code on modern hardware more often than not ends up slower. Aligning data is another issue altogether, it is critical in many instances and you align at the minimum to the data size to be read or written. Clock cycles are a left over from ancient hardware, if you had a DX386 or earlier, you would count clock cycles, when the DX 486 introduced the first x86 pipeline, speed shifted from cycle counting to instruction scheduling. Later processors have multiple pipelines so scheduling is even more important. Internally x86 is RISC but it provides a CISC interface of x86/x64 instructions and this gives you a preferred instruction set, this is why you stay away from the ancient 8088/86 instructions and use the simple fast ones.

Speed Testing !!!!
The only one that matters is REAL TIME[tm] Testing. Cycle counts are a waste of time.

Jokaste

We have no choice for aligning datas because of using some instructions results in a crash.
I remember the time where I select an instruction to execute into the U pipeline and the other into the V pipeline.
It seems that we have nothing to do for optimizing the code, except prefetch and branch prediction.
It is poor.
If someone could make a great post these that would help me, Intel and Amd are not easy to uunderstand, even using a translation program.
Thanks Hutch
Kenavo
---------------------------
Grincheux / Jokaste

felipe

Nice info hutch.
:icon14: :icon14:

hutch--

Its not an easy subject to address in a small space but assembler instructions (mnemonics) roughly break up into 3 classes, the antique slow ones that are stored in slow microcode, the RISC style of simple integer instructions (add, sub test etc ....) that are used for general purpose code layout and messy stuff that you cannot do with the later and much faster SSE, AVX and a few other classes of later instructions.

Over time Intel have shifted processing power more and more towards the later go fast stuff as the general purpose instructions are mainly used for addressing which has to be done to manipulate the later much larger instructions. On a PIV, SSE was not all that fast but on later Core2 then the sequence of i7 processors, the SSE, AVX and other later types got a lot faster as more silicon was dedicated to their operation. There are a few special cases like REP MOVSQ that are fast enough but the shift keeps going in the direction of the later much faster stuff.

In particular the older ideas of optimisation matter less and less with later instructions as the action is in getting the instructions scheduled in an efficient order so that the much larger data sizes are stacked efficiently. Instruction count and size matter little these days as the world has changed, with 16 bit real mode DOS, 1 meg was a big deal, win3 was a weird hybrid in a couple of forms due to very limited memory and a 16 meg address range, win32 has a theoretical 4 gig range that you could only normally use 2 gig and win64 can address up to about 128 gig due to limitations in the size of the page table controlled by the OS.

The rough distinction here is that cute trick on DOS com files simply do not matter in win32/64, using an extra meg simply does not matter in the range of multiple gigabytes. If you can get a performance gain by using more memory, or expanding a technique into a table or any of a number of conventional tricks, do it as the size does not matter. When you have to work in BYTE data as many traditional algorithms must do, you can still write fast code if you do the normal things, minimum instruction counts, careful choice of instructions and good algorithm design.

felipe

Ok, so it seems than aligning data in the more recent hardware (intel microprocessors) it is still a good way for optimizing the code, but aligning labels for jumps or procedures are not.

Please correct me if i'm wrong.  :biggrin:

Jokaste

I have looked many programs (even in the windows directory), and many of them align the code on a 16 byte boundary. If they do that there is a reason. In Intel and AMD support files they also tell to align code and datas. I contnue the research. Thanks for your answers. :t
Kenavo
---------------------------
Grincheux / Jokaste

aw27

Quote from: Jokaste on November 16, 2017, 06:17:47 PM
I have looked many programs (even in the windows directory), and many of them align the code on a 16 byte boundary. If they do that there is a reason. In Intel and AMD support files they also tell to align code and datas. I contnue the research. Thanks for your answers. :t
I hope you will come up with some code that supports your beliefs.  :eusa_naughty: