xlat is pretty fast

HSE · September 28, 2022, 01:28:13 AM

Quote from: jj2007 on September 19, 2022, 10:40:55 AM
Propose a better solution

Tested 3 contexts for xlat operations:

function_under_glass11 macro
	mov rax, rcx
	xlat		; mov al,[ebx+al]
endm
function_under_glass12 macro
	movzx eax, byte ptr[rbx+rcx]		; mov al,[ebx+al]
endm

function_under_glass21 macro
	xlat		; mov al,[ebx+al]
endm
function_under_glass22 macro
	movzx eax, byte ptr[rbx+rax]		; mov al,[ebx+al]
endm

function_under_glass31 macro
	movzx eax, byte ptr [rdi]
	xlat	; mov al,[ebx+al]
endm
function_under_glass32 macro
       movzx eax, byte ptr[rdi]
       movzx eax, byte ptr[rbx+rax]		; mov al,[ebx+al]
endm



measured_loop1 proc uses rbx loops:qword
    local i : qword
    mov rbx, offset somestring
    ForLp i, 0, loops, rcx
        function_under_glass12
    Next i
    ret
measured_loop1 endp

measured_loop2 proc uses rbx loops:qword
    local i : qword
    mov rbx, offset somestring
    ForLp i, 0, loops, rax
        function_under_glass22
    Next i
    ret
measured_loop2 endp

measured_loop3 proc uses rdi loops:qword
    local i : qword
    mov rdi, offset CfgTokenChars
    mov rbx, offset somestring
    ForLp i, 0, loops
        function_under_glass31
        inc rdi
    Next i
    ret
measured_loop3 endp

Always there is variations, and mean differences are less than half cycle.

Interesting, here xlat look faster in WoW64 than directly in hardware.

In WoW64:

Code Select

Intel(R) Core(TM) i3-10100 CPU @ 3.60GHz (SSE4)

377     cycles for 100 * xlat
404     cycles for 100 * movzx eax, byte ptr[ebx+ecx]

377     cycles for 100 * xlat
406     cycles for 100 * movzx eax, byte ptr[ebx+ecx]

371     cycles for 100 * xlat
404     cycles for 100 * movzx eax, byte ptr[ebx+ecx]

377     cycles for 100 * xlat
405     cycles for 100 * movzx eax, byte ptr[ebx+ecx]

13      bytes for xlat
13      bytes for movzx eax, byte ptr[ebx+ecx]

72      = eax xlat
72      = eax movzx eax, byte ptr[ebx+ecx]

In UEFI:

ops overhead code
11 41.46 5.3930
12 40.82 5.1589

21 34.0781 5.5607
22 41.0515 5.1583

31 -30.22 6.9508
32 -23.11 7.4392

Last test suggest a curvature, perhaps related to cache work ¿?

hutch-- · September 28, 2022, 02:08:40 AM

xlat is one of the old instructions that still performs well, from memory direct register algos are faster but only by a small amount and xlat is convenient enough to use. Another factor is the hardware its running on where you may get variations between them.

jj2007 · September 28, 2022, 02:20:37 AM

Quote from: HSE on September 28, 2022, 01:28:13 AMInteresting, here xlat look faster in WoW64 than directly in hardware.

What makes you think that the 32-bit code is not running directly in hardware...?

HSE · September 28, 2022, 03:12:53 AM

Quote from: jj2007 on September 28, 2022, 02:20:37 AM
What makes you think that the 32-bit code is not running directly in hardware...?

In hardware you can't use 32 bit addresses.

jj2007 · September 28, 2022, 04:24:08 AM

Quote from: HSE on September 28, 2022, 03:12:53 AM
Quote from: jj2007 on September 28, 2022, 02:20:37 AM
What makes you think that the 32-bit code is not running directly in hardware...?

In hardware you can't use 32 bit addresses.

Sure you can, it's called "compatibility mode", and it's not an emulation. It's hardware.

HSE · September 28, 2022, 06:44:59 AM

Quote from: jj2007 on September 28, 2022, 04:24:08 AM
Sure you can, it's called "compatibility mode", and it's not an emulation. It's hardware.

In Windows you don't access memory directly, but through "Virtual addresses". Then everything is an emulation in some sense.

Proof is that, just by accident, I was forgotten to change your 32 bits addresses, and program jump nowhere

I know kernels and drivers can access some 32 bit addresses (in BIOS I think).

hutch-- · September 28, 2022, 10:11:01 AM

It has been the case since antiquity that only the OS has direct access at memory, it is via the OS that an instance handle in one app may be at the same virtual address as any other app but not at the same physical address.

In long mode, all addresses must be 64 bit (anything that does not squark /LARGEADDRESSAWARE errors). Compatibility mode gets you all of the extra complexity of 64 bit but is strangled at 2 gig so I wonder why anyone bothers. I think it had some to do with Microsoft porting some of their older 32 bit apps to 64 bit.

HSE · September 28, 2022, 12:37:22 PM

Quote from: hutch-- on September 28, 2022, 10:11:01 AM
It has been the case since antiquity that only the OS has direct access at memory, it is via the OS that an instance handle in one app may be at the same virtual address as any other app but not at the same physical address.

I read something about 1958, or so.

Quote from: hutch-- on September 28, 2022, 10:11:01 AM
In long mode, all addresses must be 64 bit (anything that does not squark /LARGEADDRESSAWARE errors). Compatibility mode gets you all of the extra complexity of 64 bit but is strangled at 2 gig so I wonder why anyone bothers.

Exactly. There is no compatibility mode in hardware. In 64 bits machines only there are 64 bits addresses, compatibility mode is mostly an OS' trick.

Debuggers make the inverse trick, and you can think that compatibility mode happen in hardware

Quote from: hutch-- on September 28, 2022, 10:11:01 AM
I think it had some to do with Microsoft porting some of their older 32 bit apps to 64 bit.

Legacy support involve all hardware and software industries. Is not just MS, apparently some opcodes are only maintained in processors design to facilitate legacy software support (and that complete the compatibility trick).

hutch-- · September 28, 2022, 01:58:23 PM

One thing I am a fan of is keeping the complete mnemonic set in later hardware. Whenever things like dropping some instructions gets done, they often whack out some of the useful ones. You cannot use PUSHAD and POPAD in 64 bit which is unfortunate as they can be useful in debugging an algorithm.

NoCforMe · September 28, 2022, 02:06:08 PM

Quote from: HSE on September 28, 2022, 06:44:59 AM
In Windows you don't access memory directly, but through "Virtual addresses". Then everything is an emulation in some sense.

Proof is that, just by accident, I was forgotten to change your 32 bits addresses, and program jump nowhere

I have no idea what you mean by that second sentence. But I believe the first one is just plain wrong.

So you're saying that if I code MOV EAX, SomeMemoryAddress, that I'm not actually executing the X86 MOV instruction, moving all 32 bits of data (with all the MOD/R/M stuff)? I don't believe that.

Now of course when we do access memory like that, that memory is virtual, and it might turn out that it's in a page that's been paged out to virtual storage (disk). In that case, a page fault is generated and the OS swaps that page back into memory, so we can physically access that data. (In other words, MOV does exactly what the Intel or AMD manual tells us it does.)

Or am I wrong about this? This is my understanding of how things work.

If you're right, think how incredibly S-L-O-O-O-W the whole process would be. Can you say "emulator"?

jj2007 · September 28, 2022, 07:31:04 PM

x86: mov eax, [ebx]
x64: mov rax, [rbx]

In both cases, ebx/rbx is a pointer to a virtual address. The MMU, a highly integrated part of the CPU, translates that address to a physical address.

Claiming that "all addresses are 64 bit" is misleading. In compatibility mode, the MMU certainly translates a 32-bit address to a physical one. Unfortunately, I have not been able to find a doc that clearly says whether that physical address is in the section below 4GB. or whether it's possible that mov eax, [ebx] talks to a physical address above that area. If that is the case, the MMU will still hand over a 32-bit address to the CPU in compatibility mode. Btw the name is misleading: "32-bit mode" would be simpler and more correct. The cpu has more than one mode, fullstop.

In any case, it's hardware, not an emulation. An emulator is quite a different animal. Steem, for example, translates...

Code Select

               MOVE.W  MODE(A6),D7
               BTST    #0,D7
               BEQ.S   SCR_ENDE
               BSR.S   PPL
               BTST    #1,D7
               BEQ.S   SCR_ENDE
               BSR.S   PPL
               BSR.S   PPL

... into something that can be understood by an Intel or AMD CPU. Below an example of a BASIC program that firmly believes it's running on a 68000 CPU.

Yet another story is Wow64: Instead of running a complete parallel 32-bit version of Windows, all 32-bit processes that need to access the core Windows APIs go through a "gate" that translates their 32-bit handles etc to 64-bit equivalents and then call the native 64-bit DLLs. That process is so fast that you won't find any significant performance difference between 32- and 64-bit processes. In contrast, an emulator is typically a factor 10-20 slower than native code.

HSE · September 29, 2022, 12:13:30 AM

Quote from: NoCforMe on September 28, 2022, 02:06:08 PM
Or am I wrong about this? This is my understanding of how things work.

I don't know. I think is close to what JJ say. OS store a base address somewhere, and perform an hardware function.

Quote from: jj2007 on September 28, 2022, 07:31:04 PM
In both cases, ebx/rbx is a pointer to a virtual address. The MMU, a highly integrated part of the CPU, translates that address to a physical address.

I think that is correct if you are running a program inside an OS. I'm searching, from time to time, how to do that without an OS.

Quote from: jj2007 on September 28, 2022, 07:31:04 PM
In compatibility mode,

Must not be confused CPU modes with OS modes. UEFI put CPU in 64 bits mode. Once CPU is running 64 bits, there is no comeback (is a little different with Legacy BIOS, but this machine run from UEFI because integrated graphic card requiere that

).

Quote from: jj2007 on September 28, 2022, 07:31:04 PM
I have not been able to find a doc that clearly says whether that physical address is in the section below 4GB

I think User Virtual Address Space in Windows is 2GB by default and System Virtual Address Space also 2GB, and in Linux is optional 1/3, 2/2 or 3/1. That for 32-bits. For 64 bits is 128TB ( not sure this last).

Quote from: jj2007 on September 28, 2022, 07:31:04 PM
In any case, it's hardware, not an emulation.

Is mostly hardware, but not exactly your code

.

jj2007 · September 29, 2022, 12:40:02 AM

Quote from: HSE on September 29, 2022, 12:13:30 AM
Quote from: jj2007 on September 28, 2022, 07:31:04 PM
I have not been able to find a doc that clearly says whether that physical address is in the section below 4GB

I think User Virtual Address Space in Windows is 2GB by default and System Virtual Address Space also 2GB, and in Linux is optional 1/3, 2/2 or 3/1. That for 32-bits. For 64 bits is 128TB ( not sure this last).

The question is actually what happens if 16GB RAM are installed: Can the MMU hand over a physical address above 4GB to a 32-bit process, obviously translated to a 32-bit pointer? My suspicion is that it can serve only the 0...4GB range, but I can't find evidence on the web.

hutch-- · September 29, 2022, 12:59:33 AM

Some of this stuff sounds like it comes out of Alice In Wonderland.

Look up the details of a 32 bit PE (portable executable) file and you will find the expression "Relative Virtual Address" which tells you that the ancient form of direct memory access (early MS-DOS) is not available in "Protected Mode".

Protected mode means the OS controls the entire memory space including the memory that you allocate and its why you get access violations when you try and write past the end of allocated memory.

EVERY APP (no exceptions) runs in its own memory space which is allocated and controlled by the OS and there is no leakage across applications due to this isolation. The technique to share data between apps is memory mapped files which the OS provides for that purpose.

It is both contained in the hardware and the OS to be able to run both 32 and 64 bit apps. They both can run in a 64 bit OS version because the OS supports it. You cannot run 16 bit apps natively as the OS no longer supports it.

There is an ugly hybrid where you can run some 32 bit code in a 64 bit PE file but it comes at a cost of a 2 gig memory limit so you get the worst of both worlds, 64 bit complexity and 32 bit memory limitations.

Go back to 16 bit Windows before true hardware multitasking and you had the joys of writing a perfect app, only to have it trashed by some piece of chyte that trashed something in the OS and crashed Windows. Protected mode and true memory isolation is a blessing.

HSE · September 29, 2022, 01:06:52 AM

The MASM Forum

News:

xlat is pretty fast

HSE

hutch--

jj2007

HSE

jj2007

HSE

hutch--

HSE

hutch--

NoCforMe

jj2007

HSE

jj2007

hutch--

HSE