xlat is pretty fast

jj2007 · September 19, 2022, 08:11:42 AM

Intel(R) Core(TM) i5-2450M CPU @ 2.50GHz (SSE4)

394     cycles for 100 * xlat
458     cycles for 100 * movzx eax, byte ptr[ebx+ecx]

386     cycles for 100 * xlat
457     cycles for 100 * movzx eax, byte ptr[ebx+ecx]

387     cycles for 100 * xlat
457     cycles for 100 * movzx eax, byte ptr[ebx+ecx]

387     cycles for 100 * xlat
458     cycles for 100 * movzx eax, byte ptr[ebx+ecx]

13      bytes for xlat
14      bytes for movzx eax, byte ptr[ebx+ecx]

72      = eax xlat
72      = eax movzx eax, byte ptr[ebx+ecx]

Code Select

align_64
NameA equ xlat	; assign a descriptive name for each test
TestA proc
  mov ebx, offset somestring
  push 99
  .Repeat
	xor ecx, ecx		; 256
	align 4
	.Repeat
		mov eax, ecx
		xlat		; mov al,[ebx+al]
		dec ecx
	.Until Sign?
	dec stack
  .Until Sign?
  pop edx
  ret
TestA endp

align_64
NameB equ movzx eax, byte ptr[ebx+ecx]
TestB proc
  mov ebx, offset somestring
  push 99
  .Repeat
	xor ecx, ecx		; 256
	align 4
	.Repeat
		movzx eax, byte ptr[ebx+ecx]		; mov al,[ebx+al]
		dec ecx
	.Until Sign?
	dec stack
  .Until Sign?
  pop edx
  ret
TestB endp

zedd151 · September 19, 2022, 08:19:55 AM

Code Select

Intel(R) Core(TM)2 Duo CPU     E8400  @ 3.00GHz (SSE4)

500     cycles for 100 * xlat
480     cycles for 100 * movzx eax, byte ptr[ebx+ecx]

500     cycles for 100 * xlat
481     cycles for 100 * movzx eax, byte ptr[ebx+ecx]

500     cycles for 100 * xlat
479     cycles for 100 * movzx eax, byte ptr[ebx+ecx]

501     cycles for 100 * xlat
480     cycles for 100 * movzx eax, byte ptr[ebx+ecx]

13      bytes for xlat
14      bytes for movzx eax, byte ptr[ebx+ecx]

72      = eax xlat
72      = eax movzx eax, byte ptr[ebx+ecx]

--- ok ---

Looks like a mixed bag. A little slower for my machine.
Been a little while since the last algo speed tests

HSE · September 19, 2022, 08:28:21 AM

But for Test A must be

Code Select

xor eax, eax
...
dec eax

without mov eax, ecx ¿No?

jj2007 · September 19, 2022, 09:54:12 AM

Quote from: HSE on September 19, 2022, 08:28:21 AM

But for Test A must be
Code Select Expand
xor eax, eax ... dec eax without mov eax, ecx ¿No?

xlat changes al. So you can't use eax as the loop counter.

HSE · September 19, 2022, 10:11:53 AM

Sorry.

Yet, that mov eax, ecx make comparison unfair.

zedd151 · September 19, 2022, 10:27:52 AM

Quote from: HSE on September 19, 2022, 10:11:53 AM
Yet, that mov eax, ecx make comparison unfair.

jj2007 · September 19, 2022, 10:40:55 AM

Quote from: HSE on September 19, 2022, 10:11:53 AM
Sorry.

Yet, that mov eax, ecx make comparison unfair.

Propose a better solution

HSE · September 19, 2022, 08:28:50 PM

Quote from: jj2007 on September 19, 2022, 10:40:55 AM
Propose a better solution

Next week.

Perhaps you can't evaluate xlat out of context.

TimoVJL · September 19, 2022, 08:33:00 PM

Code Select

AMD Athlon(tm) II X2 220 Processor (SSE3)

452     cycles for 100 * xlat
460     cycles for 100 * movzx eax, byte ptr[ebx+ecx]

453     cycles for 100 * xlat
458     cycles for 100 * movzx eax, byte ptr[ebx+ecx]

451     cycles for 100 * xlat
457     cycles for 100 * movzx eax, byte ptr[ebx+ecx]

670     cycles for 100 * xlat
459     cycles for 100 * movzx eax, byte ptr[ebx+ecx]

13      bytes for xlat
14      bytes for movzx eax, byte ptr[ebx+ecx]

72      = eax xlat
72      = eax movzx eax, byte ptr[ebx+ecx]

--- ok ---

daydreamer · September 19, 2022, 10:40:43 PM

Code Select

Intel(R) Core(TM) i5-7200U CPU @ 2.50GHz (SSE4)

385     cycles for 100 * xlat
363     cycles for 100 * movzx eax, byte ptr[ebx+ecx]

385     cycles for 100 * xlat
367     cycles for 100 * movzx eax, byte ptr[ebx+ecx]

385     cycles for 100 * xlat
366     cycles for 100 * movzx eax, byte ptr[ebx+ecx]

387     cycles for 100 * xlat
365     cycles for 100 * movzx eax, byte ptr[ebx+ecx]

13      bytes for xlat
14      bytes for movzx eax, byte ptr[ebx+ecx]

72      = eax xlat
72      = eax movzx eax, byte ptr[ebx+ecx]

hutch-- · September 19, 2022, 11:34:30 PM

This is the second window that popped up. I don't have one of the Xeons turned on at the moment.

Intel(R) Core(TM) i7-5820K CPU @ 3.30GHz (SSE4)

403 cycles for 100 * xlat
398 cycles for 100 * movzx eax, byte ptr[ebx+ecx]

403 cycles for 100 * xlat
397 cycles for 100 * movzx eax, byte ptr[ebx+ecx]

405 cycles for 100 * xlat
397 cycles for 100 * movzx eax, byte ptr[ebx+ecx]

403 cycles for 100 * xlat
395 cycles for 100 * movzx eax, byte ptr[ebx+ecx]

13 bytes for xlat
14 bytes for movzx eax, byte ptr[ebx+ecx]

72 = eax xlat
72 = eax movzx eax, byte ptr[ebx+ecx]

--- ok ---

FORTRANS · September 20, 2022, 12:35:59 AM

Hi,

Three systems, two runs each.

Code Select

F:\TEMP\TEST>xlattimi
pre-P4 (SSE1)

439     cycles for 100 * xlat
412     cycles for 100 * movzx eax, byte ptr[ebx+ecx]

439     cycles for 100 * xlat
412     cycles for 100 * movzx eax, byte ptr[ebx+ecx]

439     cycles for 100 * xlat
412     cycles for 100 * movzx eax, byte ptr[ebx+ecx]

439     cycles for 100 * xlat
412     cycles for 100 * movzx eax, byte ptr[ebx+ecx]

13      bytes for xlat
14      bytes for movzx eax, byte ptr[ebx+ecx]

72      = eax xlat
72      = eax movzx eax, byte ptr[ebx+ecx]

--- ok ---

pre-P4 (SSE1)

440	cycles for 100 * xlat
415	cycles for 100 * movzx eax, byte ptr[ebx+ecx]

444	cycles for 100 * xlat
413	cycles for 100 * movzx eax, byte ptr[ebx+ecx]

440	cycles for 100 * xlat
424	cycles for 100 * movzx eax, byte ptr[ebx+ecx]

441	cycles for 100 * xlat
413	cycles for 100 * movzx eax, byte ptr[ebx+ecx]

13	bytes for xlat
14	bytes for movzx eax, byte ptr[ebx+ecx]

72	= eax xlat
72	= eax movzx eax, byte ptr[ebx+ecx]

--- ok ---

Intel(R) Pentium(R) M processor 1.70GHz (SSE2)

510     cycles for 100 * xlat
298     cycles for 100 * movzx eax, byte ptr[ebx+ecx]

508     cycles for 100 * xlat
295     cycles for 100 * movzx eax, byte ptr[ebx+ecx]

511     cycles for 100 * xlat
309     cycles for 100 * movzx eax, byte ptr[ebx+ecx]

503     cycles for 100 * xlat
294     cycles for 100 * movzx eax, byte ptr[ebx+ecx]

13      bytes for xlat
14      bytes for movzx eax, byte ptr[ebx+ecx]

72      = eax xlat
72      = eax movzx eax, byte ptr[ebx+ecx]

--- ok ---

Intel(R) Pentium(R) M processor 1.70GHz (SSE2)

510	cycles for 100 * xlat
297	cycles for 100 * movzx eax, byte ptr[ebx+ecx]

509	cycles for 100 * xlat
301	cycles for 100 * movzx eax, byte ptr[ebx+ecx]

509	cycles for 100 * xlat
296	cycles for 100 * movzx eax, byte ptr[ebx+ecx]

510	cycles for 100 * xlat
305	cycles for 100 * movzx eax, byte ptr[ebx+ecx]

13	bytes for xlat
14	bytes for movzx eax, byte ptr[ebx+ecx]

72	= eax xlat
72	= eax movzx eax, byte ptr[ebx+ecx]

--- ok ---



Intel(R) Core(TM) i3-4005U CPU @ 1.70GHz (SSE4)

488     cycles for 100 * xlat
480     cycles for 100 * movzx eax, byte ptr[ebx+ecx]

489     cycles for 100 * xlat
480     cycles for 100 * movzx eax, byte ptr[ebx+ecx]

490     cycles for 100 * xlat
480     cycles for 100 * movzx eax, byte ptr[ebx+ecx]

488     cycles for 100 * xlat
479     cycles for 100 * movzx eax, byte ptr[ebx+ecx]

13      bytes for xlat
14      bytes for movzx eax, byte ptr[ebx+ecx]

72      = eax xlat
72      = eax movzx eax, byte ptr[ebx+ecx]

--- ok ---

Intel(R) Core(TM) i3-4005U CPU @ 1.70GHz (SSE4)

490	cycles for 100 * xlat
486	cycles for 100 * movzx eax, byte ptr[ebx+ecx]

490	cycles for 100 * xlat
481	cycles for 100 * movzx eax, byte ptr[ebx+ecx]

486	cycles for 100 * xlat
483	cycles for 100 * movzx eax, byte ptr[ebx+ecx]

491	cycles for 100 * xlat
480	cycles for 100 * movzx eax, byte ptr[ebx+ecx]

13	bytes for xlat
14	bytes for movzx eax, byte ptr[ebx+ecx]

72	= eax xlat
72	= eax movzx eax, byte ptr[ebx+ecx]

--- ok ---

Regards,

Steve

jj2007 · September 20, 2022, 12:39:50 AM

Thanks, interesting

HSE · September 20, 2022, 04:12:47 AM

Quote from: daydreamer on April 10, 1975, 08:52:04 PM
I am curious on performance on LUT sine/cosine ,xlat bytes vs other with real4 LUT

No Magnus. Xlat input is a byte and output also is a byte. You can't retrieve a floating point number.

jj2007 · September 20, 2022, 05:08:14 AM

Quote from: HSE on September 20, 2022, 04:12:47 AM
Quote from: daydreamer on April 10, 1975, 08:52:04 PM
I am curious on performance on LUT sine/cosine ,xlat bytes vs other with real4 LUT

No Magnus. Xlat input is a byte and output also is a byte. You can't retrieve a floating point number.

True, but mov eax, [ebx+4*ecx] is equally fast and would work for REAL4

The MASM Forum

News:

xlat is pretty fast

jj2007

zedd151

HSE

jj2007

HSE

zedd151

jj2007

HSE

TimoVJL

daydreamer

hutch--

FORTRANS

jj2007

HSE

jj2007