News:

Masm32 SDK description, downloads and other helpful links
Message to All Guests
NB: Posting URL's See here: Posted URL Change

Main Menu

xlat is pretty fast

Started by jj2007, September 19, 2022, 08:11:42 AM

Previous topic - Next topic

jj2007

Quote from: hutch-- on September 29, 2022, 12:59:33 AM
Some of this stuff sounds like it comes out of Alice In Wonderland.

Your rant is perfectly unrelated to the questions raised, but if it makes you happy to praise the 64-bit world, so be it :thumbsup:

hutch--

And worse, some of the claims sound like they were spoken by the mad hatter.

PE specs are clear cut, OS versions change over time and have different characteristics, the last OS version to support both 16 and 32 bit was XP (from memory), Win7 64 and up do not support 16 bit code natively, on 64 bit OS versions, 32 bit code is supported in both hardware and the OS and protected mode makes it all possible.  :biggrin:

PS: XLAT works fine and will keep most people happy most of the time.

NoCforMe

So I'm pretty sure the correct answer is an combination of several opinions offered above:

  • Hutch's assertion that addresses given to ASM opcodes are, indeed, virtual addresses is correct. That's how the paging scheme I described works; if a virtual address doesn't point to actual, physical memory, the OS loads the page in question from "backing store" (disk file). This, of course, is a simplification of the process, but it's basically how it works. (That's why page faults are a good thing in this case!
  • Other than that (address translation), the opcodes we specify in our ASM source code are actually, physically executed. At some point an actual MOV operation has to actually move something between physical memory and a register (or two registers, or two memory locations using DMA*.)
Now that second point ignores the whole thing of microcode, which are the actual, for real, bona fide hardware operations that take place when we ask for, say, a REP STOSB. We can think of our opcodes as functions, and microcode is the actual (hardware) code inside those functions. But since nobody (that I know of) has ever seen this code, much less written it, it's only of academic interest. (Can't be read or written, so far as I know. At least not in the code stream.)

Speaking of which, does anyone know where one can find X86 microcode? I'm not sure if this is (Intel/AMD, etc.) proprietary information or not. I am curious to see how it works. And apparently it can be uploaded via BIOS.

* I may be thinking of the Olde Tymes, when the 2-parameter string instructions (MOVSB) used actual DMA to do the data transfer. Probably not true anymore with this newfangled hardware ...
Assembly language programming should be fun. That's why I do it.

hutch--

Opcodes are opcodes and protected mode has no reason to interfere with this and in fact opcodes work directly on OS provided memory. Protected mode is literally OS controlled memory which ensures that one app cannot write to memory in another app like it used to be in Win3.? where some piece of crap trashed the entire OS.

Win 3.? was in fact one single app that emulated multitasking in software and it was genuinely clever stuff but with the advent of hardware multitasking, its great limitation was bypassed and the hardware provided the capacity to fully isolate each app in its own memory space. Much of what you are paying for when you buy a Windows version is this capacity to reliably run multiple apps and this is apart from all of the rest of the facilities in an OS.

Now instruction encodings are another matter altogether. In the days of 8088 CPUs, the whole instruction set was directly encoded in silicon but as the instruction set became much larger on much later hardware, x86 CISC was an interface to a RISC based instruction set which looked nothing like the old stuff. Each iteration of Intel (and probably AMD) hardware did much the same where you had preferred instructions (the simple ones) and the older more complex instructions that were dumped in much slower microcode.

The simple ones were much faster and if you needed something from the older instructions, it was available but you tried to avoid them for performance reasons. There are some special cases like the instructions that take a REP prefix. While MOVSD is as slow as a wet week by itself, REP MOVSD is special case circuitry that is competitively fast on modern hardware.

daydreamer

Could we go back to time my word LUT instead of this debate?
http://masm32.com/board/index.php?topic=7938.msg103017#msg103017
my none asm creations
https://masm32.com/board/index.php?topic=6937.msg74303#msg74303
I am an Invoker
"An Invoker is a mage who specializes in the manipulation of raw and elemental energies."
Like SIMD coding

hutch--

magnus,

While I am pleased to see you writing code, this topic was originally about timing the XLAT instruction. It wandered due to a number of technical issues but it was not what you had in mind with the LUT that you are investigating.

zedd151

From my brand new laptop:
Intel(R) Celeron(R) N5105 @ 2.00GHz (SSE4)

417     cycles for 100 * xlat
352     cycles for 100 * movzx eax, byte ptr[ebx+ecx]

422     cycles for 100 * xlat
352     cycles for 100 * movzx eax, byte ptr[ebx+ecx]

426     cycles for 100 * xlat
357     cycles for 100 * movzx eax, byte ptr[ebx+ecx]

435     cycles for 100 * xlat
352     cycles for 100 * movzx eax, byte ptr[ebx+ecx]

13      bytes for xlat
14      bytes for movzx eax, byte ptr[ebx+ecx]

72      = eax xlat
72      = eax movzx eax, byte ptr[ebx+ecx]

--- ok ---

From older computer in previous post here:
Intel(R) Core(TM)2 Duo CPU     E8400  @ 3.00GHz (SSE4)

500     cycles for 100 * xlat
480     cycles for 100 * movzx eax, byte ptr[ebx+ecx]

500     cycles for 100 * xlat
481     cycles for 100 * movzx eax, byte ptr[ebx+ecx]

500     cycles for 100 * xlat
479     cycles for 100 * movzx eax, byte ptr[ebx+ecx]

501     cycles for 100 * xlat
480     cycles for 100 * movzx eax, byte ptr[ebx+ecx]

13      bytes for xlat
14      bytes for movzx eax, byte ptr[ebx+ecx]

72      = eax xlat
72      = eax movzx eax, byte ptr[ebx+ecx]

--- ok ---

Just running a few performance tests to check new vs. old
sorry for bumping an older thread.   :skrewy:
Ventanas diez es el mejor.  :azn:

daydreamer

Hi zedd,Testing celeron with its smaller cache vs 'normal' CPU would be interesting
my none asm creations
https://masm32.com/board/index.php?topic=6937.msg74303#msg74303
I am an Invoker
"An Invoker is a mage who specializes in the manipulation of raw and elemental energies."
Like SIMD coding