Print Page - xlat is pretty fast

Title: xlat is pretty fast
Post by: jj2007 on September 19, 2022, 08:11:42 AM

Intel(R) Core(TM) i5-2450M CPU @ 2.50GHz (SSE4)

394     cycles for 100 * xlat
458     cycles for 100 * movzx eax, byte ptr[ebx+ecx]

386     cycles for 100 * xlat
457     cycles for 100 * movzx eax, byte ptr[ebx+ecx]

387     cycles for 100 * xlat
457     cycles for 100 * movzx eax, byte ptr[ebx+ecx]

387     cycles for 100 * xlat
458     cycles for 100 * movzx eax, byte ptr[ebx+ecx]

13      bytes for xlat
14      bytes for movzx eax, byte ptr[ebx+ecx]

72      = eax xlat
72      = eax movzx eax, byte ptr[ebx+ecx]

Code Select

align_64
NameA equ xlat	; assign a descriptive name for each test
TestA proc
  mov ebx, offset somestring
  push 99
  .Repeat
	xor ecx, ecx		; 256
	align 4
	.Repeat
		mov eax, ecx
		xlat		; mov al,[ebx+al]
		dec ecx
	.Until Sign?
	dec stack
  .Until Sign?
  pop edx
  ret
TestA endp

align_64
NameB equ movzx eax, byte ptr[ebx+ecx]
TestB proc
  mov ebx, offset somestring
  push 99
  .Repeat
	xor ecx, ecx		; 256
	align 4
	.Repeat
		movzx eax, byte ptr[ebx+ecx]		; mov al,[ebx+al]
		dec ecx
	.Until Sign?
	dec stack
  .Until Sign?
  pop edx
  ret
TestB endp

Title: Re: xlat is pretty fast
Post by: zedd151 on September 19, 2022, 08:19:55 AM

Code Select

Intel(R) Core(TM)2 Duo CPU     E8400  @ 3.00GHz (SSE4)

500     cycles for 100 * xlat
480     cycles for 100 * movzx eax, byte ptr[ebx+ecx]

500     cycles for 100 * xlat
481     cycles for 100 * movzx eax, byte ptr[ebx+ecx]

500     cycles for 100 * xlat
479     cycles for 100 * movzx eax, byte ptr[ebx+ecx]

501     cycles for 100 * xlat
480     cycles for 100 * movzx eax, byte ptr[ebx+ecx]

13      bytes for xlat
14      bytes for movzx eax, byte ptr[ebx+ecx]

72      = eax xlat
72      = eax movzx eax, byte ptr[ebx+ecx]

--- ok ---

Looks like a mixed bag. A little slower for my machine.
Been a little while since the last algo speed tests

Title: Re: xlat is pretty fast
Post by: HSE on September 19, 2022, 08:28:21 AM

:thumbsup:

But for Test A must be

Code Select

xor eax, eax
...
dec eax

without mov eax, ecx ¿No?

Title: Re: xlat is pretty fast
Post by: jj2007 on September 19, 2022, 09:54:12 AM

Quote from: HSE on September 19, 2022, 08:28:21 AM
:thumbsup:

But for Test A must be
Code Select Expand
xor eax, eax ... dec eax without mov eax, ecx ¿No?

xlat changes al. So you can't use eax as the loop counter.

Title: Re: xlat is pretty fast
Post by: HSE on September 19, 2022, 10:11:53 AM

:biggrin: :biggrin: Sorry.

Yet, that mov eax, ecx make comparison unfair.

Title: Re: xlat is pretty fast
Post by: zedd151 on September 19, 2022, 10:27:52 AM

Quote from: HSE on September 19, 2022, 10:11:53 AM
Yet, that mov eax, ecx make comparison unfair.

:tongue:

Title: Re: xlat is pretty fast
Post by: jj2007 on September 19, 2022, 10:40:55 AM

Quote from: HSE on September 19, 2022, 10:11:53 AM
:biggrin: :biggrin: Sorry.

Yet, that mov eax, ecx make comparison unfair.

Propose a better solution :thumbsup:

Title: Re: xlat is pretty fast
Post by: HSE on September 19, 2022, 08:28:50 PM

Quote from: jj2007 on September 19, 2022, 10:40:55 AM
Propose a better solution :thumbsup:

:thumbsup: Next week.

Perhaps you can't evaluate xlat out of context.

Title: Re: xlat is pretty fast
Post by: TimoVJL on September 19, 2022, 08:33:00 PM

Code Select

AMD Athlon(tm) II X2 220 Processor (SSE3)

452     cycles for 100 * xlat
460     cycles for 100 * movzx eax, byte ptr[ebx+ecx]

453     cycles for 100 * xlat
458     cycles for 100 * movzx eax, byte ptr[ebx+ecx]

451     cycles for 100 * xlat
457     cycles for 100 * movzx eax, byte ptr[ebx+ecx]

670     cycles for 100 * xlat
459     cycles for 100 * movzx eax, byte ptr[ebx+ecx]

13      bytes for xlat
14      bytes for movzx eax, byte ptr[ebx+ecx]

72      = eax xlat
72      = eax movzx eax, byte ptr[ebx+ecx]

--- ok ---

Title: Re: xlat is pretty fast
Post by: daydreamer on September 19, 2022, 10:40:43 PM

Code Select

Intel(R) Core(TM) i5-7200U CPU @ 2.50GHz (SSE4)

385     cycles for 100 * xlat
363     cycles for 100 * movzx eax, byte ptr[ebx+ecx]

385     cycles for 100 * xlat
367     cycles for 100 * movzx eax, byte ptr[ebx+ecx]

385     cycles for 100 * xlat
366     cycles for 100 * movzx eax, byte ptr[ebx+ecx]

387     cycles for 100 * xlat
365     cycles for 100 * movzx eax, byte ptr[ebx+ecx]

13      bytes for xlat
14      bytes for movzx eax, byte ptr[ebx+ecx]

72      = eax xlat
72      = eax movzx eax, byte ptr[ebx+ecx]

Title: Re: xlat is pretty fast
Post by: hutch-- on September 19, 2022, 11:34:30 PM

This is the second window that popped up. I don't have one of the Xeons turned on at the moment.

Intel(R) Core(TM) i7-5820K CPU @ 3.30GHz (SSE4)

403 cycles for 100 * xlat
398 cycles for 100 * movzx eax, byte ptr[ebx+ecx]

403 cycles for 100 * xlat
397 cycles for 100 * movzx eax, byte ptr[ebx+ecx]

405 cycles for 100 * xlat
397 cycles for 100 * movzx eax, byte ptr[ebx+ecx]

403 cycles for 100 * xlat
395 cycles for 100 * movzx eax, byte ptr[ebx+ecx]

13 bytes for xlat
14 bytes for movzx eax, byte ptr[ebx+ecx]

72 = eax xlat
72 = eax movzx eax, byte ptr[ebx+ecx]

--- ok ---

Title: Re: xlat is pretty fast
Post by: FORTRANS on September 20, 2022, 12:35:59 AM

Hi,

Three systems, two runs each.

Code Select

F:\TEMP\TEST>xlattimi
pre-P4 (SSE1)

439     cycles for 100 * xlat
412     cycles for 100 * movzx eax, byte ptr[ebx+ecx]

439     cycles for 100 * xlat
412     cycles for 100 * movzx eax, byte ptr[ebx+ecx]

439     cycles for 100 * xlat
412     cycles for 100 * movzx eax, byte ptr[ebx+ecx]

439     cycles for 100 * xlat
412     cycles for 100 * movzx eax, byte ptr[ebx+ecx]

13      bytes for xlat
14      bytes for movzx eax, byte ptr[ebx+ecx]

72      = eax xlat
72      = eax movzx eax, byte ptr[ebx+ecx]

--- ok ---

pre-P4 (SSE1)

440	cycles for 100 * xlat
415	cycles for 100 * movzx eax, byte ptr[ebx+ecx]

444	cycles for 100 * xlat
413	cycles for 100 * movzx eax, byte ptr[ebx+ecx]

440	cycles for 100 * xlat
424	cycles for 100 * movzx eax, byte ptr[ebx+ecx]

441	cycles for 100 * xlat
413	cycles for 100 * movzx eax, byte ptr[ebx+ecx]

13	bytes for xlat
14	bytes for movzx eax, byte ptr[ebx+ecx]

72	= eax xlat
72	= eax movzx eax, byte ptr[ebx+ecx]

--- ok ---

Intel(R) Pentium(R) M processor 1.70GHz (SSE2)

510     cycles for 100 * xlat
298     cycles for 100 * movzx eax, byte ptr[ebx+ecx]

508     cycles for 100 * xlat
295     cycles for 100 * movzx eax, byte ptr[ebx+ecx]

511     cycles for 100 * xlat
309     cycles for 100 * movzx eax, byte ptr[ebx+ecx]

503     cycles for 100 * xlat
294     cycles for 100 * movzx eax, byte ptr[ebx+ecx]

13      bytes for xlat
14      bytes for movzx eax, byte ptr[ebx+ecx]

72      = eax xlat
72      = eax movzx eax, byte ptr[ebx+ecx]

--- ok ---

Intel(R) Pentium(R) M processor 1.70GHz (SSE2)

510	cycles for 100 * xlat
297	cycles for 100 * movzx eax, byte ptr[ebx+ecx]

509	cycles for 100 * xlat
301	cycles for 100 * movzx eax, byte ptr[ebx+ecx]

509	cycles for 100 * xlat
296	cycles for 100 * movzx eax, byte ptr[ebx+ecx]

510	cycles for 100 * xlat
305	cycles for 100 * movzx eax, byte ptr[ebx+ecx]

13	bytes for xlat
14	bytes for movzx eax, byte ptr[ebx+ecx]

72	= eax xlat
72	= eax movzx eax, byte ptr[ebx+ecx]

--- ok ---



Intel(R) Core(TM) i3-4005U CPU @ 1.70GHz (SSE4)

488     cycles for 100 * xlat
480     cycles for 100 * movzx eax, byte ptr[ebx+ecx]

489     cycles for 100 * xlat
480     cycles for 100 * movzx eax, byte ptr[ebx+ecx]

490     cycles for 100 * xlat
480     cycles for 100 * movzx eax, byte ptr[ebx+ecx]

488     cycles for 100 * xlat
479     cycles for 100 * movzx eax, byte ptr[ebx+ecx]

13      bytes for xlat
14      bytes for movzx eax, byte ptr[ebx+ecx]

72      = eax xlat
72      = eax movzx eax, byte ptr[ebx+ecx]

--- ok ---

Intel(R) Core(TM) i3-4005U CPU @ 1.70GHz (SSE4)

490	cycles for 100 * xlat
486	cycles for 100 * movzx eax, byte ptr[ebx+ecx]

490	cycles for 100 * xlat
481	cycles for 100 * movzx eax, byte ptr[ebx+ecx]

486	cycles for 100 * xlat
483	cycles for 100 * movzx eax, byte ptr[ebx+ecx]

491	cycles for 100 * xlat
480	cycles for 100 * movzx eax, byte ptr[ebx+ecx]

13	bytes for xlat
14	bytes for movzx eax, byte ptr[ebx+ecx]

72	= eax xlat
72	= eax movzx eax, byte ptr[ebx+ecx]

--- ok ---

Regards,

Steve

Title: Re: xlat is pretty fast
Post by: jj2007 on September 20, 2022, 12:39:50 AM

Thanks, interesting :rolleyes:

Title: Re: xlat is pretty fast
Post by: HSE on September 20, 2022, 04:12:47 AM

Quote from: daydreamer on April 10, 1975, 08:52:04 PM
I am curious on performance on LUT sine/cosine ,xlat bytes vs other with real4 LUT

No Magnus. Xlat input is a byte and output also is a byte. You can't retrieve a floating point number.

Title: Re: xlat is pretty fast
Post by: jj2007 on September 20, 2022, 05:08:14 AM

Quote from: HSE on September 20, 2022, 04:12:47 AM
Quote from: daydreamer on April 10, 1975, 08:52:04 PM
I am curious on performance on LUT sine/cosine ,xlat bytes vs other with real4 LUT

No Magnus. Xlat input is a byte and output also is a byte. You can't retrieve a floating point number.

True, but mov eax, [ebx+4*ecx] is equally fast and would work for REAL4

Title: Re: xlat is pretty fast
Post by: HSE on September 28, 2022, 01:28:13 AM

Quote from: jj2007 on September 19, 2022, 10:40:55 AM
Propose a better solution :thumbsup:

Tested 3 contexts for xlat operations:

Code Select

function_under_glass11 macro
	mov rax, rcx
	xlat		; mov al,[ebx+al]
endm
function_under_glass12 macro
	movzx eax, byte ptr[rbx+rcx]		; mov al,[ebx+al]
endm

function_under_glass21 macro
	xlat		; mov al,[ebx+al]
endm
function_under_glass22 macro
	movzx eax, byte ptr[rbx+rax]		; mov al,[ebx+al]
endm

function_under_glass31 macro
	movzx eax, byte ptr [rdi]
	xlat	; mov al,[ebx+al]
endm
function_under_glass32 macro
       movzx eax, byte ptr[rdi]
       movzx eax, byte ptr[rbx+rax]		; mov al,[ebx+al]
endm



measured_loop1 proc uses rbx loops:qword
    local i : qword
    mov rbx, offset somestring
    ForLp i, 0, loops, rcx
        function_under_glass12
    Next i
    ret
measured_loop1 endp

measured_loop2 proc uses rbx loops:qword
    local i : qword
    mov rbx, offset somestring
    ForLp i, 0, loops, rax
        function_under_glass22
    Next i
    ret
measured_loop2 endp

measured_loop3 proc uses rdi loops:qword
    local i : qword
    mov rdi, offset CfgTokenChars
    mov rbx, offset somestring
    ForLp i, 0, loops
        function_under_glass31
        inc rdi
    Next i
    ret
measured_loop3 endp

Always there is variations, and mean differences are less than half cycle.

Interesting, here xlat look faster in WoW64 than directly in hardware.

In WoW64:

Code Select

Intel(R) Core(TM) i3-10100 CPU @ 3.60GHz (SSE4)

377     cycles for 100 * xlat
404     cycles for 100 * movzx eax, byte ptr[ebx+ecx]

377     cycles for 100 * xlat
406     cycles for 100 * movzx eax, byte ptr[ebx+ecx]

371     cycles for 100 * xlat
404     cycles for 100 * movzx eax, byte ptr[ebx+ecx]

377     cycles for 100 * xlat
405     cycles for 100 * movzx eax, byte ptr[ebx+ecx]

13      bytes for xlat
13      bytes for movzx eax, byte ptr[ebx+ecx]

72      = eax xlat
72      = eax movzx eax, byte ptr[ebx+ecx]

In UEFI:

ops overhead code
11 41.46 5.3930
12 40.82 5.1589

21 34.0781 5.5607
22 41.0515 5.1583

31 -30.22 6.9508
32 -23.11 7.4392

Last test suggest a curvature, perhaps related to cache work ¿?

Title: Re: xlat is pretty fast
Post by: hutch-- on September 28, 2022, 02:08:40 AM

xlat is one of the old instructions that still performs well, from memory direct register algos are faster but only by a small amount and xlat is convenient enough to use. Another factor is the hardware its running on where you may get variations between them.

Title: Re: xlat is pretty fast
Post by: jj2007 on September 28, 2022, 02:20:37 AM

Quote from: HSE on September 28, 2022, 01:28:13 AMInteresting, here xlat look faster in WoW64 than directly in hardware.

What makes you think that the 32-bit code is not running directly in hardware...?

Title: Re: xlat is pretty fast
Post by: HSE on September 28, 2022, 03:12:53 AM

Quote from: jj2007 on September 28, 2022, 02:20:37 AM
What makes you think that the 32-bit code is not running directly in hardware...?

In hardware you can't use 32 bit addresses.

Title: Re: xlat is pretty fast
Post by: jj2007 on September 28, 2022, 04:24:08 AM

Quote from: HSE on September 28, 2022, 03:12:53 AM
Quote from: jj2007 on September 28, 2022, 02:20:37 AM
What makes you think that the 32-bit code is not running directly in hardware...?

In hardware you can't use 32 bit addresses.

Sure you can, it's called "compatibility mode", and it's not an emulation. It's hardware.

Title: Re: xlat is pretty fast
Post by: HSE on September 28, 2022, 06:44:59 AM

Quote from: jj2007 on September 28, 2022, 04:24:08 AM
Sure you can, it's called "compatibility mode", and it's not an emulation. It's hardware.

In Windows you don't access memory directly, but through "Virtual addresses". Then everything is an emulation in some sense.

Proof is that, just by accident, I was forgotten to change your 32 bits addresses, and program jump nowhere :biggrin: :biggrin:

I know kernels and drivers can access some 32 bit addresses (in BIOS I think).

Title: Re: xlat is pretty fast
Post by: hutch-- on September 28, 2022, 10:11:01 AM

It has been the case since antiquity that only the OS has direct access at memory, it is via the OS that an instance handle in one app may be at the same virtual address as any other app but not at the same physical address.

In long mode, all addresses must be 64 bit (anything that does not squark /LARGEADDRESSAWARE errors). Compatibility mode gets you all of the extra complexity of 64 bit but is strangled at 2 gig so I wonder why anyone bothers. I think it had some to do with Microsoft porting some of their older 32 bit apps to 64 bit.

Title: Re: xlat is pretty fast
Post by: HSE on September 28, 2022, 12:37:22 PM

Quote from: hutch-- on September 28, 2022, 10:11:01 AM
It has been the case since antiquity that only the OS has direct access at memory, it is via the OS that an instance handle in one app may be at the same virtual address as any other app but not at the same physical address.

:thumbsup: I read something about 1958, or so.

Quote from: hutch-- on September 28, 2022, 10:11:01 AM
In long mode, all addresses must be 64 bit (anything that does not squark /LARGEADDRESSAWARE errors). Compatibility mode gets you all of the extra complexity of 64 bit but is strangled at 2 gig so I wonder why anyone bothers.

Exactly. There is no compatibility mode in hardware. In 64 bits machines only there are 64 bits addresses, compatibility mode is mostly an OS' trick.

Debuggers make the inverse trick, and you can think that compatibility mode happen in hardware :biggrin:

Quote from: hutch-- on September 28, 2022, 10:11:01 AM
I think it had some to do with Microsoft porting some of their older 32 bit apps to 64 bit.

Legacy support involve all hardware and software industries. Is not just MS, apparently some opcodes are only maintained in processors design to facilitate legacy software support (and that complete the compatibility trick).

Title: Re: xlat is pretty fast
Post by: hutch-- on September 28, 2022, 01:58:23 PM

One thing I am a fan of is keeping the complete mnemonic set in later hardware. Whenever things like dropping some instructions gets done, they often whack out some of the useful ones. You cannot use PUSHAD and POPAD in 64 bit which is unfortunate as they can be useful in debugging an algorithm.

Title: Re: xlat is pretty fast
Post by: NoCforMe on September 28, 2022, 02:06:08 PM

Quote from: HSE on September 28, 2022, 06:44:59 AM
In Windows you don't access memory directly, but through "Virtual addresses". Then everything is an emulation in some sense.

Proof is that, just by accident, I was forgotten to change your 32 bits addresses, and program jump nowhere :biggrin: :biggrin:

I have no idea what you mean by that second sentence. But I believe the first one is just plain wrong.

So you're saying that if I code MOV EAX, SomeMemoryAddress, that I'm not actually executing the X86 MOV instruction, moving all 32 bits of data (with all the MOD/R/M stuff)? I don't believe that.

Now of course when we do access memory like that, that memory is virtual, and it might turn out that it's in a page that's been paged out to virtual storage (disk). In that case, a page fault is generated and the OS swaps that page back into memory, so we can physically access that data. (In other words, MOV does exactly what the Intel or AMD manual tells us it does.)

Or am I wrong about this? This is my understanding of how things work.

If you're right, think how incredibly S-L-O-O-O-W the whole process would be. Can you say "emulator"?

Title: Re: xlat is pretty fast
Post by: jj2007 on September 28, 2022, 07:31:04 PM

x86: mov eax, [ebx]
x64: mov rax, [rbx]

In both cases, ebx/rbx is a pointer to a virtual address. The MMU, a highly integrated part of the CPU, translates that address to a physical address.

Claiming that "all addresses are 64 bit" is misleading. In compatibility mode, the MMU certainly translates a 32-bit address to a physical one. Unfortunately, I have not been able to find a doc that clearly says whether that physical address is in the section below 4GB. or whether it's possible that mov eax, [ebx] talks to a physical address above that area. If that is the case, the MMU will still hand over a 32-bit address to the CPU in compatibility mode. Btw the name is misleading: "32-bit mode" would be simpler and more correct. The cpu has more than one mode, fullstop.

In any case, it's hardware, not an emulation. An emulator is quite a different animal. Steem (https://sourceforge.net/projects/steemsse/), for example, translates...

Code Select

               MOVE.W  MODE(A6),D7
               BTST    #0,D7
               BEQ.S   SCR_ENDE
               BSR.S   PPL
               BTST    #1,D7
               BEQ.S   SCR_ENDE
               BSR.S   PPL
               BSR.S   PPL

... into something that can be understood by an Intel or AMD CPU. Below an example of a BASIC program that firmly believes it's running on a 68000 CPU.

Yet another story is Wow64 (https://en.wikipedia.org/wiki/WoW64): Instead of running a complete parallel 32-bit version of Windows, all 32-bit processes that need to access the core Windows APIs go through a "gate" that translates their 32-bit handles etc to 64-bit equivalents and then call the native 64-bit DLLs. That process is so fast that you won't find any significant performance difference between 32- and 64-bit processes. In contrast, an emulator is typically a factor 10-20 slower than native code.

Title: Re: xlat is pretty fast
Post by: HSE on September 29, 2022, 12:13:30 AM

Quote from: NoCforMe on September 28, 2022, 02:06:08 PM
Or am I wrong about this? This is my understanding of how things work.

I don't know. I think is close to what JJ say. OS store a base address somewhere, and perform an hardware function.

Quote from: jj2007 on September 28, 2022, 07:31:04 PM
In both cases, ebx/rbx is a pointer to a virtual address. The MMU, a highly integrated part of the CPU, translates that address to a physical address.

I think that is correct if you are running a program inside an OS. I'm searching, from time to time, how to do that without an OS.

Quote from: jj2007 on September 28, 2022, 07:31:04 PM
In compatibility mode,

Must not be confused CPU modes with OS modes. UEFI put CPU in 64 bits mode. Once CPU is running 64 bits, there is no comeback (is a little different with Legacy BIOS, but this machine run from UEFI because integrated graphic card requiere that :biggrin:).

Quote from: jj2007 on September 28, 2022, 07:31:04 PM
I have not been able to find a doc that clearly says whether that physical address is in the section below 4GB

I think User Virtual Address Space in Windows is 2GB by default and System Virtual Address Space also 2GB, and in Linux is optional 1/3, 2/2 or 3/1. That for 32-bits. For 64 bits is 128TB ( not sure this last).

Quote from: jj2007 on September 28, 2022, 07:31:04 PM
In any case, it's hardware, not an emulation.

Is mostly hardware, but not exactly your code :thumbsup:.

Title: Re: xlat is pretty fast
Post by: jj2007 on September 29, 2022, 12:40:02 AM

Quote from: HSE on September 29, 2022, 12:13:30 AM
Quote from: jj2007 on September 28, 2022, 07:31:04 PM
I have not been able to find a doc that clearly says whether that physical address is in the section below 4GB

I think User Virtual Address Space in Windows is 2GB by default and System Virtual Address Space also 2GB, and in Linux is optional 1/3, 2/2 or 3/1. That for 32-bits. For 64 bits is 128TB ( not sure this last).

The question is actually what happens if 16GB RAM are installed: Can the MMU hand over a physical address above 4GB to a 32-bit process, obviously translated to a 32-bit pointer? My suspicion is that it can serve only the 0...4GB range, but I can't find evidence on the web.

Title: Re: xlat is pretty fast
Post by: hutch-- on September 29, 2022, 12:59:33 AM

Some of this stuff sounds like it comes out of Alice In Wonderland.

Look up the details of a 32 bit PE (portable executable) file and you will find the expression "Relative Virtual Address" which tells you that the ancient form of direct memory access (early MS-DOS) is not available in "Protected Mode".

Protected mode means the OS controls the entire memory space including the memory that you allocate and its why you get access violations when you try and write past the end of allocated memory.

EVERY APP (no exceptions) runs in its own memory space which is allocated and controlled by the OS and there is no leakage across applications due to this isolation. The technique to share data between apps is memory mapped files which the OS provides for that purpose.

It is both contained in the hardware and the OS to be able to run both 32 and 64 bit apps. They both can run in a 64 bit OS version because the OS supports it. You cannot run 16 bit apps natively as the OS no longer supports it.

There is an ugly hybrid where you can run some 32 bit code in a 64 bit PE file but it comes at a cost of a 2 gig memory limit so you get the worst of both worlds, 64 bit complexity and 32 bit memory limitations.

Go back to 16 bit Windows before true hardware multitasking and you had the joys of writing a perfect app, only to have it trashed by some piece of chyte that trashed something in the OS and crashed Windows. Protected mode and true memory isolation is a blessing. :thumbsup:

Title: Re: xlat is pretty fast
Post by: HSE on September 29, 2022, 01:06:52 AM

:thumbsup:

Title: Re: xlat is pretty fast
Post by: jj2007 on September 29, 2022, 01:16:55 AM

Quote from: hutch-- on September 29, 2022, 12:59:33 AM
Some of this stuff sounds like it comes out of Alice In Wonderland.

Your rant is perfectly unrelated to the questions raised, but if it makes you happy to praise the 64-bit world, so be it :thumbsup:

Title: Re: xlat is pretty fast
Post by: hutch-- on September 29, 2022, 02:30:21 AM

And worse, some of the claims sound like they were spoken by the mad hatter.

PE specs are clear cut, OS versions change over time and have different characteristics, the last OS version to support both 16 and 32 bit was XP (from memory), Win7 64 and up do not support 16 bit code natively, on 64 bit OS versions, 32 bit code is supported in both hardware and the OS and protected mode makes it all possible. :biggrin:

PS: XLAT works fine and will keep most people happy most of the time.

Title: Re: xlat is pretty fast
Post by: NoCforMe on September 29, 2022, 07:54:38 AM

So I'm pretty sure the correct answer is an combination of several opinions offered above:

Hutch's assertion that addresses given to ASM opcodes are, indeed, virtual addresses is correct. That's how the paging scheme I described works; if a virtual address doesn't point to actual, physical memory, the OS loads the page in question from "backing store" (disk file). This, of course, is a simplification of the process, but it's basically how it works. (That's why page faults are a good thing in this case!
Other than that (address translation), the opcodes we specify in our ASM source code are actually, physically executed. At some point an actual MOV operation has to actually move something between physical memory and a register (or two registers, or two memory locations using DMA*.)

Now that second point ignores the whole thing of microcode, which are the actual, for real, bona fide hardware operations that take place when we ask for, say, a REP STOSB. We can think of our opcodes as functions, and microcode is the actual (hardware) code inside those functions. But since nobody (that I know of) has ever seen this code, much less written it, it's only of academic interest. (Can't be read or written, so far as I know. At least not in the code stream.)

Speaking of which, does anyone know where one can find X86 microcode? I'm not sure if this is (Intel/AMD, etc.) proprietary information or not. I am curious to see how it works. And apparently it can be uploaded via BIOS.

* I may be thinking of the Olde Tymes, when the 2-parameter string instructions (MOVSB) used actual DMA to do the data transfer. Probably not true anymore with this newfangled hardware ...

Title: Re: xlat is pretty fast
Post by: hutch-- on September 29, 2022, 09:41:09 AM

Opcodes are opcodes and protected mode has no reason to interfere with this and in fact opcodes work directly on OS provided memory. Protected mode is literally OS controlled memory which ensures that one app cannot write to memory in another app like it used to be in Win3.? where some piece of crap trashed the entire OS.

Win 3.? was in fact one single app that emulated multitasking in software and it was genuinely clever stuff but with the advent of hardware multitasking, its great limitation was bypassed and the hardware provided the capacity to fully isolate each app in its own memory space. Much of what you are paying for when you buy a Windows version is this capacity to reliably run multiple apps and this is apart from all of the rest of the facilities in an OS.

Now instruction encodings are another matter altogether. In the days of 8088 CPUs, the whole instruction set was directly encoded in silicon but as the instruction set became much larger on much later hardware, x86 CISC was an interface to a RISC based instruction set which looked nothing like the old stuff. Each iteration of Intel (and probably AMD) hardware did much the same where you had preferred instructions (the simple ones) and the older more complex instructions that were dumped in much slower microcode.

The simple ones were much faster and if you needed something from the older instructions, it was available but you tried to avoid them for performance reasons. There are some special cases like the instructions that take a REP prefix. While MOVSD is as slow as a wet week by itself, REP MOVSD is special case circuitry that is competitively fast on modern hardware.

Title: Re: xlat is pretty fast
Post by: daydreamer on September 29, 2022, 05:47:06 PM

Could we go back to time my word LUT instead of this debate?
http://masm32.com/board/index.php?topic=7938.msg103017#msg103017

Title: Re: xlat is pretty fast
Post by: hutch-- on September 29, 2022, 06:40:40 PM

magnus,

While I am pleased to see you writing code, this topic was originally about timing the XLAT instruction. It wandered due to a number of technical issues but it was not what you had in mind with the LUT that you are investigating.

Title: Re: xlat is pretty fast
Post by: zedd on November 14, 2024, 09:01:28 AM

From my brand new laptop:

Code Select

Intel(R) Celeron(R) N5105 @ 2.00GHz (SSE4)

417     cycles for 100 * xlat
352     cycles for 100 * movzx eax, byte ptr[ebx+ecx]

422     cycles for 100 * xlat
352     cycles for 100 * movzx eax, byte ptr[ebx+ecx]

426     cycles for 100 * xlat
357     cycles for 100 * movzx eax, byte ptr[ebx+ecx]

435     cycles for 100 * xlat
352     cycles for 100 * movzx eax, byte ptr[ebx+ecx]

13      bytes for xlat
14      bytes for movzx eax, byte ptr[ebx+ecx]

72      = eax xlat
72      = eax movzx eax, byte ptr[ebx+ecx]

--- ok ---

From older computer in previous post here:

Code Select

Intel(R) Core(TM)2 Duo CPU     E8400  @ 3.00GHz (SSE4)

500     cycles for 100 * xlat
480     cycles for 100 * movzx eax, byte ptr[ebx+ecx]

500     cycles for 100 * xlat
481     cycles for 100 * movzx eax, byte ptr[ebx+ecx]

500     cycles for 100 * xlat
479     cycles for 100 * movzx eax, byte ptr[ebx+ecx]

501     cycles for 100 * xlat
480     cycles for 100 * movzx eax, byte ptr[ebx+ecx]

13      bytes for xlat
14      bytes for movzx eax, byte ptr[ebx+ecx]

72      = eax xlat
72      = eax movzx eax, byte ptr[ebx+ecx]

--- ok ---

Just running a few performance tests to check new vs. old
sorry for bumping an older thread. :skrewy:

Title: Re: xlat is pretty fast
Post by: daydreamer on November 17, 2024, 07:39:41 PM

Hi zedd,Testing celeron with its smaller cache vs 'normal' CPU would be interesting

The MASM Forum

General => The Laboratory => Topic started by: jj2007 on September 19, 2022, 08:11:42 AM