News:

Masm32 SDK description, downloads and other helpful links
Message to All Guests
NB: Posting URL's See here: Posted URL Change

Main Menu

ToUpper & ToLower timings

Started by jj2007, January 11, 2022, 01:06:27 PM

Previous topic - Next topic

jj2007

This collection is almost 8 years old, but apparently it has never been tested in the Lab :cool:

Intel(R) Core(TM) i5-2450M CPU @ 2.50GHz (SSE4)
+++17 of 20 tests valid, loop overhead is approx. 644/100 cycles

4074    kCycles for 100 * cmp al, new dest
1623    kCycles for 100 * table, new dest
3063    kCycles for 100 * cmp al, in place
1669    kCycles for 100 * xlat, new dest
5571    kCycles for 100 * LevelUp, in place

4528    kCycles for 100 * cmp al, new dest
1618    kCycles for 100 * table, new dest
3631    kCycles for 100 * cmp al, in place
2802    kCycles for 100 * xlat, new dest
5553    kCycles for 100 * LevelUp, in place

4569    kCycles for 100 * cmp al, new dest
1633    kCycles for 100 * table, new dest
3076    kCycles for 100 * cmp al, in place
1669    kCycles for 100 * xlat, new dest
5569    kCycles for 100 * LevelUp, in place

75      bytes for cmp al, new dest
65      bytes for table, new dest
54      bytes for cmp al, in place
35      bytes for xlat, new dest
66      bytes for LevelUp, in place


See here for similar stuff at the FreeBasic forum

LiaoMi

Hi  :tongue:,

11th Gen Intel(R) Core(TM) i7-11800H @ 2.30GHz (SSE4)

1378    kCycles for 100 * cmp al, new dest
660     kCycles for 100 * table, new dest
1142    kCycles for 100 * cmp al, in place
960     kCycles for 100 * xlat, new dest
1395    kCycles for 100 * LevelUp, in place

1425    kCycles for 100 * cmp al, new dest
674     kCycles for 100 * table, new dest
1185    kCycles for 100 * cmp al, in place
1006    kCycles for 100 * xlat, new dest
1425    kCycles for 100 * LevelUp, in place

1365    kCycles for 100 * cmp al, new dest
671     kCycles for 100 * table, new dest
1159    kCycles for 100 * cmp al, in place
1012    kCycles for 100 * xlat, new dest
1472    kCycles for 100 * LevelUp, in place

75      bytes for cmp al, new dest
65      bytes for table, new dest
54      bytes for cmp al, in place
35      bytes for xlat, new dest
66      bytes for LevelUp, in place


--- ok ---

daydreamer

Intel(R) Core(TM) i5-7200U CPU @ 2.50GHz (SSE4)
++++++14 of 20 tests valid, loop overhead is approx. 323/100 cycles

4207    kCycles for 100 * cmp al, new dest
1550    kCycles for 100 * table, new dest
1980    kCycles for 100 * cmp al, in place
1785    kCycles for 100 * xlat, new dest
3248    kCycles for 100 * LevelUp, in place

5186    kCycles for 100 * cmp al, new dest
1568    kCycles for 100 * table, new dest
2188    kCycles for 100 * cmp al, in place
1821    kCycles for 100 * xlat, new dest
3231    kCycles for 100 * LevelUp, in place

5047    kCycles for 100 * cmp al, new dest
1667    kCycles for 100 * table, new dest
1964    kCycles for 100 * cmp al, in place
1777    kCycles for 100 * xlat, new dest
3310    kCycles for 100 * LevelUp, in place

75      bytes for cmp al, new dest
65      bytes for table, new dest
54      bytes for cmp al, in place
35      bytes for xlat, new dest
66      bytes for LevelUp, in place


-
my none asm creations
https://masm32.com/board/index.php?topic=6937.msg74303#msg74303
I am an Invoker
"An Invoker is a mage who specializes in the manipulation of raw and elemental energies."
Like SIMD coding

TimoVJL

AMD Ryzen 5 3400G with Radeon Vega Graphics     (SSE4)
+19 of 20 tests valid, loop overhead is approx. 322/100 cycles

3438    kCycles for 100 * cmp al, new dest
1883    kCycles for 100 * table, new dest
3641    kCycles for 100 * cmp al, in place
2602    kCycles for 100 * xlat, new dest
2967    kCycles for 100 * LevelUp, in place

3917    kCycles for 100 * cmp al, new dest
1882    kCycles for 100 * table, new dest
3654    kCycles for 100 * cmp al, in place
2513    kCycles for 100 * xlat, new dest
2975    kCycles for 100 * LevelUp, in place

3839    kCycles for 100 * cmp al, new dest
1879    kCycles for 100 * table, new dest
3644    kCycles for 100 * cmp al, in place
2618    kCycles for 100 * xlat, new dest
2903    kCycles for 100 * LevelUp, in place

75      bytes for cmp al, new dest
65      bytes for table, new dest
54      bytes for cmp al, in place
35      bytes for xlat, new dest
66      bytes for LevelUp, in place
May the source be with you

coaster

Intel(R) Core(TM) i7-7660U CPU @ 2.50GHz (SSE4)
+++++15 of 20 tests valid, loop overhead is approx. 402/100 cycles

3153    kCycles for 100 * cmp al, new dest
1174    kCycles for 100 * table, new dest
1435    kCycles for 100 * cmp al, in place
1584    kCycles for 100 * xlat, new dest
3355    kCycles for 100 * LevelUp, in place

3770    kCycles for 100 * cmp al, new dest
1191    kCycles for 100 * table, new dest
1510    kCycles for 100 * cmp al, in place
1737    kCycles for 100 * xlat, new dest
4648    kCycles for 100 * LevelUp, in place

3725    kCycles for 100 * cmp al, new dest
1334    kCycles for 100 * table, new dest
1969    kCycles for 100 * cmp al, in place
2227    kCycles for 100 * xlat, new dest
4367    kCycles for 100 * LevelUp, in place

75      bytes for cmp al, new dest
65      bytes for table, new dest
54      bytes for cmp al, in place
35      bytes for xlat, new dest
66      bytes for LevelUp, in place


jj2007

And the winner is...

xor ecx, ecx
align 4
.Repeat
movzx eax, byte ptr [edx+ecx] ; edx is source string
movzx eax, byte ptr TheTable[eax] ; mov al is equally fast on i5
mov [edi+ecx], al ; edi is destination string
inc ecx
.Until ecx>=SrcBytes


I've added two more table-based solutions, both fast and 28 bytes short:

Intel(R) Core(TM) i5-2450M CPU @ 2.50GHz (SSE4)
++++16 of 20 tests valid, loop overhead is approx. 600/100 cycles

4062    kCycles for 100 * cmp al, new dest
1622    kCycles for 100 * table, new dest
3075    kCycles for 100 * cmp al, in place
1668    kCycles for 100 * xlat, new dest
1638    kCycles for 100 * table, source zero-delimited
1634    kCycles for 100 * table, source zero-delimited, stosb

4057    kCycles for 100 * cmp al, new dest
1617    kCycles for 100 * table, new dest
3058    kCycles for 100 * cmp al, in place
1666    kCycles for 100 * xlat, new dest
1620    kCycles for 100 * table, source zero-delimited
1623    kCycles for 100 * table, source zero-delimited, stosb

4089    kCycles for 100 * cmp al, new dest
1607    kCycles for 100 * table, new dest
3191    kCycles for 100 * cmp al, in place
1702    kCycles for 100 * xlat, new dest
1621    kCycles for 100 * table, source zero-delimited
1626    kCycles for 100 * table, source zero-delimited, stosb

75      bytes for cmp al, new dest
37      bytes for table, new dest
54      bytes for cmp al, in place
35      bytes for xlat, new dest
28      bytes for table, source zero-delimited
28      bytes for table, source zero-delimited, stosb

TimoVJL

AMD Ryzen 5 3400G with Radeon Vega Graphics     (SSE4)

3494    kCycles for 100 * cmp al, new dest
1657    kCycles for 100 * table, new dest
2839    kCycles for 100 * cmp al, in place
2547    kCycles for 100 * xlat, new dest
1672    kCycles for 100 * table, source zero-delimited
2977    kCycles for 100 * table, source zero-delimited, stosb

3485    kCycles for 100 * cmp al, new dest
1719    kCycles for 100 * table, new dest
2883    kCycles for 100 * cmp al, in place
2515    kCycles for 100 * xlat, new dest
1635    kCycles for 100 * table, source zero-delimited
2971    kCycles for 100 * table, source zero-delimited, stosb

3440    kCycles for 100 * cmp al, new dest
1681    kCycles for 100 * table, new dest
2804    kCycles for 100 * cmp al, in place
2502    kCycles for 100 * xlat, new dest
1652    kCycles for 100 * table, source zero-delimited
2938    kCycles for 100 * table, source zero-delimited, stosb

75      bytes for cmp al, new dest
37      bytes for table, new dest
54      bytes for cmp al, in place
35      bytes for xlat, new dest
28      bytes for table, source zero-delimited
28      bytes for table, source zero-delimited, stosb
May the source be with you

six_L

Intel(R) Core(TM) i5-9400H CPU @ 2.50GHz (SSE4)

2941    kCycles for 100 * cmp al, new dest
999     kCycles for 100 * table, new dest
1898    kCycles for 100 * cmp al, in place
1177    kCycles for 100 * xlat, new dest
1183    kCycles for 100 * table, source zero-delimited
1257    kCycles for 100 * table, source zero-delimited, stosb

2994    kCycles for 100 * cmp al, new dest
890     kCycles for 100 * table, new dest
1902    kCycles for 100 * cmp al, in place
1161    kCycles for 100 * xlat, new dest
1171    kCycles for 100 * table, source zero-delimited
1187    kCycles for 100 * table, source zero-delimited, stosb

2946    kCycles for 100 * cmp al, new dest
909     kCycles for 100 * table, new dest
1923    kCycles for 100 * cmp al, in place
1236    kCycles for 100 * xlat, new dest
1352    kCycles for 100 * table, source zero-delimited
1157    kCycles for 100 * table, source zero-delimited, stosb

75      bytes for cmp al, new dest
37      bytes for table, new dest
54      bytes for cmp al, in place
35      bytes for xlat, new dest
28      bytes for table, source zero-delimited
28      bytes for table, source zero-delimited, stosb


--- ok ---
Say you, Say me, Say the codes together for ever.

LiaoMi

11th Gen Intel(R) Core(TM) i7-11800H @ 2.30GHz (SSE4)
-19 of 20 tests valid, loop overhead is approx. 78/100 cycles

1397    kCycles for 100 * cmp al, new dest
662     kCycles for 100 * table, new dest
1133    kCycles for 100 * cmp al, in place
966     kCycles for 100 * xlat, new dest
850     kCycles for 100 * table, source zero-delimited
907     kCycles for 100 * table, source zero-delimited, stosb

1348    kCycles for 100 * cmp al, new dest
682     kCycles for 100 * table, new dest
1088    kCycles for 100 * cmp al, in place
924     kCycles for 100 * xlat, new dest
864     kCycles for 100 * table, source zero-delimited
884     kCycles for 100 * table, source zero-delimited, stosb

1389    kCycles for 100 * cmp al, new dest
681     kCycles for 100 * table, new dest
1069    kCycles for 100 * cmp al, in place
936     kCycles for 100 * xlat, new dest
852     kCycles for 100 * table, source zero-delimited
861     kCycles for 100 * table, source zero-delimited, stosb

75      bytes for cmp al, new dest
37      bytes for table, new dest
54      bytes for cmp al, in place
35      bytes for xlat, new dest
28      bytes for table, source zero-delimited
28      bytes for table, source zero-delimited, stosb


--- ok ---


11th Gen Intel(R) Core(TM) i7-11800H @ 2.30GHz (SSE4)

1354    kCycles for 100 * cmp al, new dest
652     kCycles for 100 * table, new dest
1084    kCycles for 100 * cmp al, in place
945     kCycles for 100 * xlat, new dest
843     kCycles for 100 * table, source zero-delimited
879     kCycles for 100 * table, source zero-delimited, stosb

1402    kCycles for 100 * cmp al, new dest
671     kCycles for 100 * table, new dest
1110    kCycles for 100 * cmp al, in place
974     kCycles for 100 * xlat, new dest
916     kCycles for 100 * table, source zero-delimited
939     kCycles for 100 * table, source zero-delimited, stosb

1357    kCycles for 100 * cmp al, new dest
648     kCycles for 100 * table, new dest
1239    kCycles for 100 * cmp al, in place
995     kCycles for 100 * xlat, new dest
871     kCycles for 100 * table, source zero-delimited
871     kCycles for 100 * table, source zero-delimited, stosb

75      bytes for cmp al, new dest
37      bytes for table, new dest
54      bytes for cmp al, in place
35      bytes for xlat, new dest
28      bytes for table, source zero-delimited
28      bytes for table, source zero-delimited, stosb

Biterider


Intel(R) Core(TM) i7-4770K CPU @ 3.50GHz (SSE4)

4736    kCycles for 100 * cmp al, new dest
1600    kCycles for 100 * table, new dest
2282    kCycles for 100 * cmp al, in place
2394    kCycles for 100 * xlat, new dest
1705    kCycles for 100 * table, source zero-delimited
2012    kCycles for 100 * table, source zero-delimited, stosb

4969    kCycles for 100 * cmp al, new dest
1782    kCycles for 100 * table, new dest
2606    kCycles for 100 * cmp al, in place
2197    kCycles for 100 * xlat, new dest
1927    kCycles for 100 * table, source zero-delimited
2170    kCycles for 100 * table, source zero-delimited, stosb

4848    kCycles for 100 * cmp al, new dest
3238    kCycles for 100 * table, new dest
2276    kCycles for 100 * cmp al, in place
1797    kCycles for 100 * xlat, new dest
1692    kCycles for 100 * table, source zero-delimited
1916    kCycles for 100 * table, source zero-delimited, stosb

75      bytes for cmp al, new dest
37      bytes for table, new dest
54      bytes for cmp al, in place
35      bytes for xlat, new dest
28      bytes for table, source zero-delimited
28      bytes for table, source zero-delimited, stosb


-


Biterider

hutch--

I just wonder how many terabytes you would need to see the difference.  :tongue:

Intel(R) Core(TM) i7-5820K CPU @ 3.30GHz (SSE4)

3931    kCycles for 100 * cmp al, new dest
1224    kCycles for 100 * table, new dest
1869    kCycles for 100 * cmp al, in place
2007    kCycles for 100 * xlat, new dest
5581    kCycles for 100 * LevelUp, in place

4133    kCycles for 100 * cmp al, new dest
1225    kCycles for 100 * table, new dest
2107    kCycles for 100 * cmp al, in place
1735    kCycles for 100 * xlat, new dest
5706    kCycles for 100 * LevelUp, in place

4605    kCycles for 100 * cmp al, new dest
1605    kCycles for 100 * table, new dest
1844    kCycles for 100 * cmp al, in place
1555    kCycles for 100 * xlat, new dest
5506    kCycles for 100 * LevelUp, in place

75      bytes for cmp al, new dest
65      bytes for table, new dest
54      bytes for cmp al, in place
35      bytes for xlat, new dest
66      bytes for LevelUp, in place

NoCforMe

Pardon me as I intrude here, as an amateur, but let me get this straight: we're talking about ToUpper() which turns any ASCII alpha characters into uppercase? Is that right?

If so, the method being used seems waaay too complex. Here's mine, which I've been using for decades now:

;============================================
; ToUpper()
;
; If character in AL is alphabetic, uppercases it.
;============================================

ToUpper PROC
CMP AL, 'a'
JB tu99
CMP AL, 'z'
JA tu99
AND AL, 5FH
tu99: RET

ToUpper ENDP


Just twiddle a couple bits is all. (ToLower() would operate similarly, by setting rather than clearing bits.)

Any objections? or am I missing something obvious here?
Assembly language programming should be fun. That's why I do it.

jj2007

NoCforMe,

Your routine works for a single char, while the algos in the testbed translate a whole string to uppercase. The first one called "cmp al, new dest" does exactly what yours is doing. However, this one (using a table) is over twice as fast, and very short:

mov edx, offset Src
mov edi, offset Dest
xor eax, eax
align 4
.While 1
movzx ecx, byte ptr [edx+eax]
jecxz @out
movzx ecx, byte ptr TheTable[ecx]
mov [edi+eax], cl
inc eax
.Endw
@out:


As Hutch writes above, there is rarely a need for so much speed. We are doing this just for fun :cool:

Vortex

Hi Jochen,

Thanks, I liked your code creating the lookup table :thumbsup: Better and more elegant than my handcrafted tables.

hutch--

JJ,

If you have the time, clock these two out of the library.

; ¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤

OPTION PROLOGUE:NONE
OPTION EPILOGUE:NONE

align 4

szLower proc text:DWORD

  ; -----------------------------
  ; converts string to lower case
  ; invoke szLower,ADDR szString
  ; -----------------------------

    mov eax, [esp+4]
    dec eax

  @@:
    add eax, 1
    cmp BYTE PTR [eax], 0
    je @F
    cmp BYTE PTR [eax], "A"
    jb @B
    cmp BYTE PTR [eax], "Z"
    ja @B
    add BYTE PTR [eax], 32
    jmp @B
  @@:

    mov eax, [esp+4]

    ret 4

szLower endp

OPTION PROLOGUE:PrologueDef
OPTION EPILOGUE:EpilogueDef

; ¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤



; ¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤

OPTION PROLOGUE:NONE
OPTION EPILOGUE:NONE

align 4

szUpper proc text:DWORD

  ; -----------------------------
  ; converts string to upper case
  ; invoke szUpper,ADDR szString
  ; -----------------------------

    mov eax, [esp+4]
    dec eax

  @@:
    add eax, 1
    cmp BYTE PTR [eax], 0
    je @F
    cmp BYTE PTR [eax], "a"
    jb @B
    cmp BYTE PTR [eax], "z"
    ja @B
    sub BYTE PTR [eax], 32
    jmp @B
  @@:

    mov eax, [esp+4]

    ret 4

szUpper endp

OPTION PROLOGUE:PrologueDef
OPTION EPILOGUE:EpilogueDef

; ¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤