Intel(R) Core(TM) i5-2450M CPU @ 2.50GHz (SSE4)
394 cycles for 100 * xlat
458 cycles for 100 * movzx eax, byte ptr[ebx+ecx]
386 cycles for 100 * xlat
457 cycles for 100 * movzx eax, byte ptr[ebx+ecx]
387 cycles for 100 * xlat
457 cycles for 100 * movzx eax, byte ptr[ebx+ecx]
387 cycles for 100 * xlat
458 cycles for 100 * movzx eax, byte ptr[ebx+ecx]
13 bytes for xlat
14 bytes for movzx eax, byte ptr[ebx+ecx]
72 = eax xlat
72 = eax movzx eax, byte ptr[ebx+ecx]
align_64
NameA equ xlat ; assign a descriptive name for each test
TestA proc
mov ebx, offset somestring
push 99
.Repeat
xor ecx, ecx ; 256
align 4
.Repeat
mov eax, ecx
xlat ; mov al,[ebx+al]
dec ecx
.Until Sign?
dec stack
.Until Sign?
pop edx
ret
TestA endp
align_64
NameB equ movzx eax, byte ptr[ebx+ecx]
TestB proc
mov ebx, offset somestring
push 99
.Repeat
xor ecx, ecx ; 256
align 4
.Repeat
movzx eax, byte ptr[ebx+ecx] ; mov al,[ebx+al]
dec ecx
.Until Sign?
dec stack
.Until Sign?
pop edx
ret
TestB endp
Intel(R) Core(TM)2 Duo CPU E8400 @ 3.00GHz (SSE4)
500 cycles for 100 * xlat
480 cycles for 100 * movzx eax, byte ptr[ebx+ecx]
500 cycles for 100 * xlat
481 cycles for 100 * movzx eax, byte ptr[ebx+ecx]
500 cycles for 100 * xlat
479 cycles for 100 * movzx eax, byte ptr[ebx+ecx]
501 cycles for 100 * xlat
480 cycles for 100 * movzx eax, byte ptr[ebx+ecx]
13 bytes for xlat
14 bytes for movzx eax, byte ptr[ebx+ecx]
72 = eax xlat
72 = eax movzx eax, byte ptr[ebx+ecx]
--- ok ---
Looks like a mixed bag. A little slower for my machine.
Been a little while since the last algo speed tests
:thumbsup:
But for Test A must be xor eax, eax
...
dec eax
without mov eax, ecx ¿No?
Quote from: HSE on September 19, 2022, 08:28:21 AM
:thumbsup:
But for Test A must be xor eax, eax
...
dec eax
without mov eax, ecx ¿No?
xlat changes al. So you can't use eax as the loop counter.
:biggrin: :biggrin: Sorry.
Yet, that mov eax, ecx make comparison unfair.
Quote from: HSE on September 19, 2022, 10:11:53 AM
Yet, that mov eax, ecx make comparison unfair.
:tongue:
Quote from: HSE on September 19, 2022, 10:11:53 AM
:biggrin: :biggrin: Sorry.
Yet, that mov eax, ecx make comparison unfair.
Propose a better solution :thumbsup:
Quote from: jj2007 on September 19, 2022, 10:40:55 AM
Propose a better solution :thumbsup:
:thumbsup: Next week.
Perhaps you can't evaluate xlat out of context.
AMD Athlon(tm) II X2 220 Processor (SSE3)
452 cycles for 100 * xlat
460 cycles for 100 * movzx eax, byte ptr[ebx+ecx]
453 cycles for 100 * xlat
458 cycles for 100 * movzx eax, byte ptr[ebx+ecx]
451 cycles for 100 * xlat
457 cycles for 100 * movzx eax, byte ptr[ebx+ecx]
670 cycles for 100 * xlat
459 cycles for 100 * movzx eax, byte ptr[ebx+ecx]
13 bytes for xlat
14 bytes for movzx eax, byte ptr[ebx+ecx]
72 = eax xlat
72 = eax movzx eax, byte ptr[ebx+ecx]
--- ok ---
Intel(R) Core(TM) i5-7200U CPU @ 2.50GHz (SSE4)
385 cycles for 100 * xlat
363 cycles for 100 * movzx eax, byte ptr[ebx+ecx]
385 cycles for 100 * xlat
367 cycles for 100 * movzx eax, byte ptr[ebx+ecx]
385 cycles for 100 * xlat
366 cycles for 100 * movzx eax, byte ptr[ebx+ecx]
387 cycles for 100 * xlat
365 cycles for 100 * movzx eax, byte ptr[ebx+ecx]
13 bytes for xlat
14 bytes for movzx eax, byte ptr[ebx+ecx]
72 = eax xlat
72 = eax movzx eax, byte ptr[ebx+ecx]
This is the second window that popped up. I don't have one of the Xeons turned on at the moment.
Intel(R) Core(TM) i7-5820K CPU @ 3.30GHz (SSE4)
403 cycles for 100 * xlat
398 cycles for 100 * movzx eax, byte ptr[ebx+ecx]
403 cycles for 100 * xlat
397 cycles for 100 * movzx eax, byte ptr[ebx+ecx]
405 cycles for 100 * xlat
397 cycles for 100 * movzx eax, byte ptr[ebx+ecx]
403 cycles for 100 * xlat
395 cycles for 100 * movzx eax, byte ptr[ebx+ecx]
13 bytes for xlat
14 bytes for movzx eax, byte ptr[ebx+ecx]
72 = eax xlat
72 = eax movzx eax, byte ptr[ebx+ecx]
--- ok ---
Hi,
Three systems, two runs each.
F:\TEMP\TEST>xlattimi
pre-P4 (SSE1)
439 cycles for 100 * xlat
412 cycles for 100 * movzx eax, byte ptr[ebx+ecx]
439 cycles for 100 * xlat
412 cycles for 100 * movzx eax, byte ptr[ebx+ecx]
439 cycles for 100 * xlat
412 cycles for 100 * movzx eax, byte ptr[ebx+ecx]
439 cycles for 100 * xlat
412 cycles for 100 * movzx eax, byte ptr[ebx+ecx]
13 bytes for xlat
14 bytes for movzx eax, byte ptr[ebx+ecx]
72 = eax xlat
72 = eax movzx eax, byte ptr[ebx+ecx]
--- ok ---
pre-P4 (SSE1)
440 cycles for 100 * xlat
415 cycles for 100 * movzx eax, byte ptr[ebx+ecx]
444 cycles for 100 * xlat
413 cycles for 100 * movzx eax, byte ptr[ebx+ecx]
440 cycles for 100 * xlat
424 cycles for 100 * movzx eax, byte ptr[ebx+ecx]
441 cycles for 100 * xlat
413 cycles for 100 * movzx eax, byte ptr[ebx+ecx]
13 bytes for xlat
14 bytes for movzx eax, byte ptr[ebx+ecx]
72 = eax xlat
72 = eax movzx eax, byte ptr[ebx+ecx]
--- ok ---
Intel(R) Pentium(R) M processor 1.70GHz (SSE2)
510 cycles for 100 * xlat
298 cycles for 100 * movzx eax, byte ptr[ebx+ecx]
508 cycles for 100 * xlat
295 cycles for 100 * movzx eax, byte ptr[ebx+ecx]
511 cycles for 100 * xlat
309 cycles for 100 * movzx eax, byte ptr[ebx+ecx]
503 cycles for 100 * xlat
294 cycles for 100 * movzx eax, byte ptr[ebx+ecx]
13 bytes for xlat
14 bytes for movzx eax, byte ptr[ebx+ecx]
72 = eax xlat
72 = eax movzx eax, byte ptr[ebx+ecx]
--- ok ---
Intel(R) Pentium(R) M processor 1.70GHz (SSE2)
510 cycles for 100 * xlat
297 cycles for 100 * movzx eax, byte ptr[ebx+ecx]
509 cycles for 100 * xlat
301 cycles for 100 * movzx eax, byte ptr[ebx+ecx]
509 cycles for 100 * xlat
296 cycles for 100 * movzx eax, byte ptr[ebx+ecx]
510 cycles for 100 * xlat
305 cycles for 100 * movzx eax, byte ptr[ebx+ecx]
13 bytes for xlat
14 bytes for movzx eax, byte ptr[ebx+ecx]
72 = eax xlat
72 = eax movzx eax, byte ptr[ebx+ecx]
--- ok ---
Intel(R) Core(TM) i3-4005U CPU @ 1.70GHz (SSE4)
488 cycles for 100 * xlat
480 cycles for 100 * movzx eax, byte ptr[ebx+ecx]
489 cycles for 100 * xlat
480 cycles for 100 * movzx eax, byte ptr[ebx+ecx]
490 cycles for 100 * xlat
480 cycles for 100 * movzx eax, byte ptr[ebx+ecx]
488 cycles for 100 * xlat
479 cycles for 100 * movzx eax, byte ptr[ebx+ecx]
13 bytes for xlat
14 bytes for movzx eax, byte ptr[ebx+ecx]
72 = eax xlat
72 = eax movzx eax, byte ptr[ebx+ecx]
--- ok ---
Intel(R) Core(TM) i3-4005U CPU @ 1.70GHz (SSE4)
490 cycles for 100 * xlat
486 cycles for 100 * movzx eax, byte ptr[ebx+ecx]
490 cycles for 100 * xlat
481 cycles for 100 * movzx eax, byte ptr[ebx+ecx]
486 cycles for 100 * xlat
483 cycles for 100 * movzx eax, byte ptr[ebx+ecx]
491 cycles for 100 * xlat
480 cycles for 100 * movzx eax, byte ptr[ebx+ecx]
13 bytes for xlat
14 bytes for movzx eax, byte ptr[ebx+ecx]
72 = eax xlat
72 = eax movzx eax, byte ptr[ebx+ecx]
--- ok ---
Regards,
Steve
Thanks, interesting :rolleyes:
Quote from: daydreamer on April 10, 1975, 08:52:04 PM
I am curious on performance on LUT sine/cosine ,xlat bytes vs other with real4 LUT
No Magnus. Xlat input is a byte and output also is a byte. You can't retrieve a floating point number.
Quote from: HSE on September 20, 2022, 04:12:47 AM
Quote from: daydreamer on April 10, 1975, 08:52:04 PM
I am curious on performance on LUT sine/cosine ,xlat bytes vs other with real4 LUT
No Magnus. Xlat input is a byte and output also is a byte. You can't retrieve a floating point number.
True, but mov eax, [ebx+4*ecx] is equally fast and would work for REAL4
Quote from: jj2007 on September 19, 2022, 10:40:55 AM
Propose a better solution :thumbsup:
Tested 3 contexts for xlat operations:
function_under_glass11 macro
mov rax, rcx
xlat ; mov al,[ebx+al]
endm
function_under_glass12 macro
movzx eax, byte ptr[rbx+rcx] ; mov al,[ebx+al]
endm
function_under_glass21 macro
xlat ; mov al,[ebx+al]
endm
function_under_glass22 macro
movzx eax, byte ptr[rbx+rax] ; mov al,[ebx+al]
endm
function_under_glass31 macro
movzx eax, byte ptr [rdi]
xlat ; mov al,[ebx+al]
endm
function_under_glass32 macro
movzx eax, byte ptr[rdi]
movzx eax, byte ptr[rbx+rax] ; mov al,[ebx+al]
endm
measured_loop1 proc uses rbx loops:qword
local i : qword
mov rbx, offset somestring
ForLp i, 0, loops, rcx
function_under_glass12
Next i
ret
measured_loop1 endp
measured_loop2 proc uses rbx loops:qword
local i : qword
mov rbx, offset somestring
ForLp i, 0, loops, rax
function_under_glass22
Next i
ret
measured_loop2 endp
measured_loop3 proc uses rdi loops:qword
local i : qword
mov rdi, offset CfgTokenChars
mov rbx, offset somestring
ForLp i, 0, loops
function_under_glass31
inc rdi
Next i
ret
measured_loop3 endp
Always there is
variations, and
mean differences are less than half cycle.
Interesting, here xlat look faster in WoW64 than directly in hardware.
In WoW64:
Intel(R) Core(TM) i3-10100 CPU @ 3.60GHz (SSE4)
377 cycles for 100 * xlat
404 cycles for 100 * movzx eax, byte ptr[ebx+ecx]
377 cycles for 100 * xlat
406 cycles for 100 * movzx eax, byte ptr[ebx+ecx]
371 cycles for 100 * xlat
404 cycles for 100 * movzx eax, byte ptr[ebx+ecx]
377 cycles for 100 * xlat
405 cycles for 100 * movzx eax, byte ptr[ebx+ecx]
13 bytes for xlat
13 bytes for movzx eax, byte ptr[ebx+ecx]
72 = eax xlat
72 = eax movzx eax, byte ptr[ebx+ecx]
In UEFI:
ops overhead code 11 41.46 5.3930
12 40.82 5.1589
21 34.0781 5.5607
22 41.0515 5.1583
31 -30.22 6.9508
32 -23.11 7.4392
Last test suggest a curvature, perhaps related to cache work ¿?
xlat is one of the old instructions that still performs well, from memory direct register algos are faster but only by a small amount and xlat is convenient enough to use. Another factor is the hardware its running on where you may get variations between them.
Quote from: HSE on September 28, 2022, 01:28:13 AMInteresting, here xlat look faster in WoW64 than directly in hardware.
What makes you think that the 32-bit code is
not running directly in hardware...?
Quote from: jj2007 on September 28, 2022, 02:20:37 AM
What makes you think that the 32-bit code is not running directly in hardware...?
In hardware you can't use 32 bit addresses.
Quote from: HSE on September 28, 2022, 03:12:53 AM
Quote from: jj2007 on September 28, 2022, 02:20:37 AM
What makes you think that the 32-bit code is not running directly in hardware...?
In hardware you can't use 32 bit addresses.
Sure you can, it's called "compatibility mode", and it's
not an emulation. It's hardware.
Quote from: jj2007 on September 28, 2022, 04:24:08 AM
Sure you can, it's called "compatibility mode", and it's not an emulation. It's hardware.
In Windows you don't access memory directly, but through "Virtual addresses". Then everything is an emulation in some sense.
Proof is that, just by accident, I was forgotten to change your 32 bits addresses, and program jump nowhere :biggrin: :biggrin:
I know kernels and drivers can access some 32 bit addresses (in BIOS I think).
It has been the case since antiquity that only the OS has direct access at memory, it is via the OS that an instance handle in one app may be at the same virtual address as any other app but not at the same physical address.
In long mode, all addresses must be 64 bit (anything that does not squark /LARGEADDRESSAWARE errors). Compatibility mode gets you all of the extra complexity of 64 bit but is strangled at 2 gig so I wonder why anyone bothers. I think it had some to do with Microsoft porting some of their older 32 bit apps to 64 bit.
Quote from: hutch-- on September 28, 2022, 10:11:01 AM
It has been the case since antiquity that only the OS has direct access at memory, it is via the OS that an instance handle in one app may be at the same virtual address as any other app but not at the same physical address.
:thumbsup: I read something about 1958, or so.
Quote from: hutch-- on September 28, 2022, 10:11:01 AM
In long mode, all addresses must be 64 bit (anything that does not squark /LARGEADDRESSAWARE errors). Compatibility mode gets you all of the extra complexity of 64 bit but is strangled at 2 gig so I wonder why anyone bothers.
Exactly. There is no compatibility mode in hardware. In 64 bits machines only there are 64 bits addresses, compatibility mode is mostly an OS' trick.
Debuggers make the inverse trick, and you can think that compatibility mode happen in hardware :biggrin:
Quote from: hutch-- on September 28, 2022, 10:11:01 AM
I think it had some to do with Microsoft porting some of their older 32 bit apps to 64 bit.
Legacy support involve all hardware and software industries. Is not just MS, apparently some opcodes are only maintained in processors design to facilitate legacy software support (and that complete the compatibility trick).
One thing I am a fan of is keeping the complete mnemonic set in later hardware. Whenever things like dropping some instructions gets done, they often whack out some of the useful ones. You cannot use PUSHAD and POPAD in 64 bit which is unfortunate as they can be useful in debugging an algorithm.
Quote from: HSE on September 28, 2022, 06:44:59 AM
In Windows you don't access memory directly, but through "Virtual addresses". Then everything is an emulation in some sense.
Proof is that, just by accident, I was forgotten to change your 32 bits addresses, and program jump nowhere :biggrin: :biggrin:
I have no idea what you mean by that second sentence. But I believe the first one is just plain wrong.
So you're saying that if I code
MOV EAX, SomeMemoryAddress, that I'm
not actually executing the X86
MOV instruction, moving all 32 bits of data (with all the MOD/R/M stuff)? I don't believe that.
Now of course when we do access memory like that, that memory is virtual, and it might turn out that it's in a page that's been paged out to virtual storage (disk). In that case, a page fault is generated and the OS swaps that page back into memory, so we can
physically access that data. (In other words,
MOV does exactly what the Intel or AMD manual tells us it does.)
Or am I wrong about this? This is my understanding of how things work.
If you're right, think how incredibly S-L-O-O-O-W the whole process would be. Can you say "emulator"?
x86: mov eax, [ebx]
x64: mov rax, [rbx]
In both cases, ebx/rbx is a pointer to a virtual address. The MMU, a highly integrated part of the CPU, translates that address to a physical address.
Claiming that "all addresses are 64 bit" is misleading. In compatibility mode, the MMU certainly translates a 32-bit address to a physical one. Unfortunately, I have not been able to find a doc that clearly says whether that physical address is in the section below 4GB. or whether it's possible that mov eax, [ebx] talks to a physical address above that area. If that is the case, the MMU will still hand over a 32-bit address to the CPU in compatibility mode. Btw the name is misleading: "32-bit mode" would be simpler and more correct. The cpu has more than one mode, fullstop.
In any case, it's hardware, not an emulation. An emulator is quite a different animal. Steem (https://sourceforge.net/projects/steemsse/), for example, translates...
MOVE.W MODE(A6),D7
BTST #0,D7
BEQ.S SCR_ENDE
BSR.S PPL
BTST #1,D7
BEQ.S SCR_ENDE
BSR.S PPL
BSR.S PPL
... into something that can be understood by an Intel or AMD CPU. Below an example of a BASIC program that firmly believes it's running on a 68000 CPU.
Yet another story is Wow64 (https://en.wikipedia.org/wiki/WoW64): Instead of running a complete parallel 32-bit version of Windows, all 32-bit processes that need to access the core Windows APIs go through a "gate" that translates their 32-bit handles etc to 64-bit equivalents and then call the native 64-bit DLLs. That process is so fast that you won't find any significant performance difference between 32- and 64-bit processes. In contrast, an emulator is typically a factor 10-20 slower than native code.
Quote from: NoCforMe on September 28, 2022, 02:06:08 PM
Or am I wrong about this? This is my understanding of how things work.
I don't know. I think is close to what JJ say. OS store a base address somewhere, and perform an hardware function.
Quote from: jj2007 on September 28, 2022, 07:31:04 PM
In both cases, ebx/rbx is a pointer to a virtual address. The MMU, a highly integrated part of the CPU, translates that address to a physical address.
I think that is correct if you are running a program inside an OS. I'm searching, from time to time, how to do that without an OS.
Quote from: jj2007 on September 28, 2022, 07:31:04 PM
In compatibility mode,
Must not be confused CPU modes with OS modes. UEFI put CPU in 64 bits mode. Once CPU is running 64 bits, there is no comeback (is a little different with Legacy BIOS, but this machine run from UEFI because integrated graphic card requiere that :biggrin:).
Quote from: jj2007 on September 28, 2022, 07:31:04 PM
I have not been able to find a doc that clearly says whether that physical address is in the section below 4GB
I think User Virtual Address Space in Windows is 2GB by default and System Virtual Address Space also 2GB, and in Linux is optional 1/3, 2/2 or 3/1. That for 32-bits. For 64 bits is 128TB ( not sure this last).
Quote from: jj2007 on September 28, 2022, 07:31:04 PM
In any case, it's hardware, not an emulation.
Is mostly hardware, but not exactly your code :thumbsup:.
Quote from: HSE on September 29, 2022, 12:13:30 AMQuote from: jj2007 on September 28, 2022, 07:31:04 PM
I have not been able to find a doc that clearly says whether that physical address is in the section below 4GB
I think User Virtual Address Space in Windows is 2GB by default and System Virtual Address Space also 2GB, and in Linux is optional 1/3, 2/2 or 3/1. That for 32-bits. For 64 bits is 128TB ( not sure this last).
The question is actually what happens if 16GB RAM are installed: Can the MMU hand over a
physical address above 4GB to a 32-bit process, obviously translated to a 32-bit pointer? My suspicion is that it can serve only the 0...4GB range, but I can't find evidence on the web.
Some of this stuff sounds like it comes out of Alice In Wonderland.
Look up the details of a 32 bit PE (portable executable) file and you will find the expression "Relative Virtual Address" which tells you that the ancient form of direct memory access (early MS-DOS) is not available in "Protected Mode".
Protected mode means the OS controls the entire memory space including the memory that you allocate and its why you get access violations when you try and write past the end of allocated memory.
EVERY APP (no exceptions) runs in its own memory space which is allocated and controlled by the OS and there is no leakage across applications due to this isolation. The technique to share data between apps is memory mapped files which the OS provides for that purpose.
It is both contained in the hardware and the OS to be able to run both 32 and 64 bit apps. They both can run in a 64 bit OS version because the OS supports it. You cannot run 16 bit apps natively as the OS no longer supports it.
There is an ugly hybrid where you can run some 32 bit code in a 64 bit PE file but it comes at a cost of a 2 gig memory limit so you get the worst of both worlds, 64 bit complexity and 32 bit memory limitations.
Go back to 16 bit Windows before true hardware multitasking and you had the joys of writing a perfect app, only to have it trashed by some piece of chyte that trashed something in the OS and crashed Windows. Protected mode and true memory isolation is a blessing. :thumbsup:
:thumbsup:
Quote from: hutch-- on September 29, 2022, 12:59:33 AM
Some of this stuff sounds like it comes out of Alice In Wonderland.
Your rant is perfectly unrelated to the questions raised, but if it makes you happy to praise the 64-bit world, so be it :thumbsup:
And worse, some of the claims sound like they were spoken by the mad hatter.
PE specs are clear cut, OS versions change over time and have different characteristics, the last OS version to support both 16 and 32 bit was XP (from memory), Win7 64 and up do not support 16 bit code natively, on 64 bit OS versions, 32 bit code is supported in both hardware and the OS and protected mode makes it all possible. :biggrin:
PS: XLAT works fine and will keep most people happy most of the time.
So I'm pretty sure the correct answer is an combination of several opinions offered above:
- Hutch's assertion that addresses given to ASM opcodes are, indeed, virtual addresses is correct. That's how the paging scheme I described works; if a virtual address doesn't point to actual, physical memory, the OS loads the page in question from "backing store" (disk file). This, of course, is a simplification of the process, but it's basically how it works. (That's why page faults are a good thing in this case!
- Other than that (address translation), the opcodes we specify in our ASM source code are actually, physically executed. At some point an actual MOV operation has to actually move something between physical memory and a register (or two registers, or two memory locations using DMA*.)
Now that second point ignores the whole thing of
microcode, which are the
actual, for real, bona fide hardware operations that take place when we ask for, say, a
REP STOSB. We can think of our opcodes as functions, and microcode is the actual (hardware) code inside those functions. But since nobody (that I know of) has ever seen this code, much less written it, it's only of academic interest. (Can't be read or written, so far as I know. At least not in the code stream.)
Speaking of which, does anyone know where one can find X86 microcode? I'm not sure if this is (Intel/AMD, etc.) proprietary information or not. I am curious to see how it works. And apparently it can be uploaded via BIOS.
* I may be thinking of the Olde Tymes, when the 2-parameter string instructions (MOVSB) used actual DMA to do the data transfer. Probably not true anymore with this newfangled hardware ...
Opcodes are opcodes and protected mode has no reason to interfere with this and in fact opcodes work directly on OS provided memory. Protected mode is literally OS controlled memory which ensures that one app cannot write to memory in another app like it used to be in Win3.? where some piece of crap trashed the entire OS.
Win 3.? was in fact one single app that emulated multitasking in software and it was genuinely clever stuff but with the advent of hardware multitasking, its great limitation was bypassed and the hardware provided the capacity to fully isolate each app in its own memory space. Much of what you are paying for when you buy a Windows version is this capacity to reliably run multiple apps and this is apart from all of the rest of the facilities in an OS.
Now instruction encodings are another matter altogether. In the days of 8088 CPUs, the whole instruction set was directly encoded in silicon but as the instruction set became much larger on much later hardware, x86 CISC was an interface to a RISC based instruction set which looked nothing like the old stuff. Each iteration of Intel (and probably AMD) hardware did much the same where you had preferred instructions (the simple ones) and the older more complex instructions that were dumped in much slower microcode.
The simple ones were much faster and if you needed something from the older instructions, it was available but you tried to avoid them for performance reasons. There are some special cases like the instructions that take a REP prefix. While MOVSD is as slow as a wet week by itself, REP MOVSD is special case circuitry that is competitively fast on modern hardware.
Could we go back to time my word LUT instead of this debate?
http://masm32.com/board/index.php?topic=7938.msg103017#msg103017
magnus,
While I am pleased to see you writing code, this topic was originally about timing the XLAT instruction. It wandered due to a number of technical issues but it was not what you had in mind with the LUT that you are investigating.
From my brand new laptop:
Intel(R) Celeron(R) N5105 @ 2.00GHz (SSE4)
417 cycles for 100 * xlat
352 cycles for 100 * movzx eax, byte ptr[ebx+ecx]
422 cycles for 100 * xlat
352 cycles for 100 * movzx eax, byte ptr[ebx+ecx]
426 cycles for 100 * xlat
357 cycles for 100 * movzx eax, byte ptr[ebx+ecx]
435 cycles for 100 * xlat
352 cycles for 100 * movzx eax, byte ptr[ebx+ecx]
13 bytes for xlat
14 bytes for movzx eax, byte ptr[ebx+ecx]
72 = eax xlat
72 = eax movzx eax, byte ptr[ebx+ecx]
--- ok ---
From older computer in previous post here:
Intel(R) Core(TM)2 Duo CPU E8400 @ 3.00GHz (SSE4)
500 cycles for 100 * xlat
480 cycles for 100 * movzx eax, byte ptr[ebx+ecx]
500 cycles for 100 * xlat
481 cycles for 100 * movzx eax, byte ptr[ebx+ecx]
500 cycles for 100 * xlat
479 cycles for 100 * movzx eax, byte ptr[ebx+ecx]
501 cycles for 100 * xlat
480 cycles for 100 * movzx eax, byte ptr[ebx+ecx]
13 bytes for xlat
14 bytes for movzx eax, byte ptr[ebx+ecx]
72 = eax xlat
72 = eax movzx eax, byte ptr[ebx+ecx]
--- ok ---
Just running a few performance tests to check new vs. old
sorry for bumping an older thread. :skrewy:
Hi zedd,Testing celeron with its smaller cache vs 'normal' CPU would be interesting