News:

Masm32 SDK description, downloads and other helpful links
Message to All Guests

Main Menu

xlat is pretty fast

Started by jj2007, September 19, 2022, 08:11:42 AM

Previous topic - Next topic

jj2007

Intel(R) Core(TM) i5-2450M CPU @ 2.50GHz (SSE4)

394     cycles for 100 * xlat
458     cycles for 100 * movzx eax, byte ptr[ebx+ecx]

386     cycles for 100 * xlat
457     cycles for 100 * movzx eax, byte ptr[ebx+ecx]

387     cycles for 100 * xlat
457     cycles for 100 * movzx eax, byte ptr[ebx+ecx]

387     cycles for 100 * xlat
458     cycles for 100 * movzx eax, byte ptr[ebx+ecx]

13      bytes for xlat
14      bytes for movzx eax, byte ptr[ebx+ecx]

72      = eax xlat
72      = eax movzx eax, byte ptr[ebx+ecx]


align_64
NameA equ xlat ; assign a descriptive name for each test
TestA proc
  mov ebx, offset somestring
  push 99
  .Repeat
xor ecx, ecx ; 256
align 4
.Repeat
mov eax, ecx
xlat ; mov al,[ebx+al]
dec ecx
.Until Sign?
dec stack
  .Until Sign?
  pop edx
  ret
TestA endp

align_64
NameB equ movzx eax, byte ptr[ebx+ecx]
TestB proc
  mov ebx, offset somestring
  push 99
  .Repeat
xor ecx, ecx ; 256
align 4
.Repeat
movzx eax, byte ptr[ebx+ecx] ; mov al,[ebx+al]
dec ecx
.Until Sign?
dec stack
  .Until Sign?
  pop edx
  ret
TestB endp

zedd151

Intel(R) Core(TM)2 Duo CPU     E8400  @ 3.00GHz (SSE4)

500     cycles for 100 * xlat
480     cycles for 100 * movzx eax, byte ptr[ebx+ecx]

500     cycles for 100 * xlat
481     cycles for 100 * movzx eax, byte ptr[ebx+ecx]

500     cycles for 100 * xlat
479     cycles for 100 * movzx eax, byte ptr[ebx+ecx]

501     cycles for 100 * xlat
480     cycles for 100 * movzx eax, byte ptr[ebx+ecx]

13      bytes for xlat
14      bytes for movzx eax, byte ptr[ebx+ecx]

72      = eax xlat
72      = eax movzx eax, byte ptr[ebx+ecx]

--- ok ---



Looks like a mixed bag. A little slower for my machine.
Been a little while since the last algo speed tests

HSE

 :thumbsup:

But for Test A must be xor eax, eax
...
dec eax
without mov eax, ecx ¿No?
Equations in Assembly: SmplMath

jj2007

Quote from: HSE on September 19, 2022, 08:28:21 AM
:thumbsup:

But for Test A must be xor eax, eax
...
dec eax
without mov eax, ecx ¿No?

xlat changes al. So you can't use eax as the loop counter.

HSE

 :biggrin: :biggrin: Sorry.

Yet, that mov eax, ecx make comparison unfair.
Equations in Assembly: SmplMath

zedd151


jj2007

Quote from: HSE on September 19, 2022, 10:11:53 AM
:biggrin: :biggrin: Sorry.

Yet, that mov eax, ecx make comparison unfair.

Propose a better solution :thumbsup:

HSE

Quote from: jj2007 on September 19, 2022, 10:40:55 AM
Propose a better solution :thumbsup:

  :thumbsup: Next week.

Perhaps you can't evaluate xlat out of context.
Equations in Assembly: SmplMath

TimoVJL

AMD Athlon(tm) II X2 220 Processor (SSE3)

452     cycles for 100 * xlat
460     cycles for 100 * movzx eax, byte ptr[ebx+ecx]

453     cycles for 100 * xlat
458     cycles for 100 * movzx eax, byte ptr[ebx+ecx]

451     cycles for 100 * xlat
457     cycles for 100 * movzx eax, byte ptr[ebx+ecx]

670     cycles for 100 * xlat
459     cycles for 100 * movzx eax, byte ptr[ebx+ecx]

13      bytes for xlat
14      bytes for movzx eax, byte ptr[ebx+ecx]

72      = eax xlat
72      = eax movzx eax, byte ptr[ebx+ecx]

--- ok ---
May the source be with you

daydreamer

#9
Intel(R) Core(TM) i5-7200U CPU @ 2.50GHz (SSE4)

385     cycles for 100 * xlat
363     cycles for 100 * movzx eax, byte ptr[ebx+ecx]

385     cycles for 100 * xlat
367     cycles for 100 * movzx eax, byte ptr[ebx+ecx]

385     cycles for 100 * xlat
366     cycles for 100 * movzx eax, byte ptr[ebx+ecx]

387     cycles for 100 * xlat
365     cycles for 100 * movzx eax, byte ptr[ebx+ecx]

13      bytes for xlat
14      bytes for movzx eax, byte ptr[ebx+ecx]

72      = eax xlat
72      = eax movzx eax, byte ptr[ebx+ecx]



my none asm creations
https://masm32.com/board/index.php?topic=6937.msg74303#msg74303
I am an Invoker
"An Invoker is a mage who specializes in the manipulation of raw and elemental energies."
Like SIMD coding

hutch--

This is the second window that popped up. I don't have one of the Xeons turned on at the moment.

Intel(R) Core(TM) i7-5820K CPU @ 3.30GHz (SSE4)

403     cycles for 100 * xlat
398     cycles for 100 * movzx eax, byte ptr[ebx+ecx]

403     cycles for 100 * xlat
397     cycles for 100 * movzx eax, byte ptr[ebx+ecx]

405     cycles for 100 * xlat
397     cycles for 100 * movzx eax, byte ptr[ebx+ecx]

403     cycles for 100 * xlat
395     cycles for 100 * movzx eax, byte ptr[ebx+ecx]

13      bytes for xlat
14      bytes for movzx eax, byte ptr[ebx+ecx]

72      = eax xlat
72      = eax movzx eax, byte ptr[ebx+ecx]

--- ok ---

FORTRANS

Hi,

   Three systems, two runs each.

F:\TEMP\TEST>xlattimi
pre-P4 (SSE1)

439     cycles for 100 * xlat
412     cycles for 100 * movzx eax, byte ptr[ebx+ecx]

439     cycles for 100 * xlat
412     cycles for 100 * movzx eax, byte ptr[ebx+ecx]

439     cycles for 100 * xlat
412     cycles for 100 * movzx eax, byte ptr[ebx+ecx]

439     cycles for 100 * xlat
412     cycles for 100 * movzx eax, byte ptr[ebx+ecx]

13      bytes for xlat
14      bytes for movzx eax, byte ptr[ebx+ecx]

72      = eax xlat
72      = eax movzx eax, byte ptr[ebx+ecx]

--- ok ---

pre-P4 (SSE1)

440 cycles for 100 * xlat
415 cycles for 100 * movzx eax, byte ptr[ebx+ecx]

444 cycles for 100 * xlat
413 cycles for 100 * movzx eax, byte ptr[ebx+ecx]

440 cycles for 100 * xlat
424 cycles for 100 * movzx eax, byte ptr[ebx+ecx]

441 cycles for 100 * xlat
413 cycles for 100 * movzx eax, byte ptr[ebx+ecx]

13 bytes for xlat
14 bytes for movzx eax, byte ptr[ebx+ecx]

72 = eax xlat
72 = eax movzx eax, byte ptr[ebx+ecx]

--- ok ---

Intel(R) Pentium(R) M processor 1.70GHz (SSE2)

510     cycles for 100 * xlat
298     cycles for 100 * movzx eax, byte ptr[ebx+ecx]

508     cycles for 100 * xlat
295     cycles for 100 * movzx eax, byte ptr[ebx+ecx]

511     cycles for 100 * xlat
309     cycles for 100 * movzx eax, byte ptr[ebx+ecx]

503     cycles for 100 * xlat
294     cycles for 100 * movzx eax, byte ptr[ebx+ecx]

13      bytes for xlat
14      bytes for movzx eax, byte ptr[ebx+ecx]

72      = eax xlat
72      = eax movzx eax, byte ptr[ebx+ecx]

--- ok ---

Intel(R) Pentium(R) M processor 1.70GHz (SSE2)

510 cycles for 100 * xlat
297 cycles for 100 * movzx eax, byte ptr[ebx+ecx]

509 cycles for 100 * xlat
301 cycles for 100 * movzx eax, byte ptr[ebx+ecx]

509 cycles for 100 * xlat
296 cycles for 100 * movzx eax, byte ptr[ebx+ecx]

510 cycles for 100 * xlat
305 cycles for 100 * movzx eax, byte ptr[ebx+ecx]

13 bytes for xlat
14 bytes for movzx eax, byte ptr[ebx+ecx]

72 = eax xlat
72 = eax movzx eax, byte ptr[ebx+ecx]

--- ok ---



Intel(R) Core(TM) i3-4005U CPU @ 1.70GHz (SSE4)

488     cycles for 100 * xlat
480     cycles for 100 * movzx eax, byte ptr[ebx+ecx]

489     cycles for 100 * xlat
480     cycles for 100 * movzx eax, byte ptr[ebx+ecx]

490     cycles for 100 * xlat
480     cycles for 100 * movzx eax, byte ptr[ebx+ecx]

488     cycles for 100 * xlat
479     cycles for 100 * movzx eax, byte ptr[ebx+ecx]

13      bytes for xlat
14      bytes for movzx eax, byte ptr[ebx+ecx]

72      = eax xlat
72      = eax movzx eax, byte ptr[ebx+ecx]

--- ok ---

Intel(R) Core(TM) i3-4005U CPU @ 1.70GHz (SSE4)

490 cycles for 100 * xlat
486 cycles for 100 * movzx eax, byte ptr[ebx+ecx]

490 cycles for 100 * xlat
481 cycles for 100 * movzx eax, byte ptr[ebx+ecx]

486 cycles for 100 * xlat
483 cycles for 100 * movzx eax, byte ptr[ebx+ecx]

491 cycles for 100 * xlat
480 cycles for 100 * movzx eax, byte ptr[ebx+ecx]

13 bytes for xlat
14 bytes for movzx eax, byte ptr[ebx+ecx]

72 = eax xlat
72 = eax movzx eax, byte ptr[ebx+ecx]

--- ok ---


Regards,

Steve

jj2007


HSE

Quote from: daydreamer on April 10, 1975, 08:52:04 PM
I am curious on performance on LUT sine/cosine ,xlat bytes vs other with real4 LUT

No Magnus. Xlat input is a byte and output also is a byte. You can't retrieve a floating point number.
Equations in Assembly: SmplMath

jj2007

Quote from: HSE on September 20, 2022, 04:12:47 AM
Quote from: daydreamer on April 10, 1975, 08:52:04 PM
I am curious on performance on LUT sine/cosine ,xlat bytes vs other with real4 LUT

No Magnus. Xlat input is a byte and output also is a byte. You can't retrieve a floating point number.

True, but mov eax, [ebx+4*ecx] is equally fast and would work for REAL4