The MASM Forum

General => The Laboratory => Topic started by: jj2007 on January 11, 2022, 01:06:27 PM

Title: ToUpper & ToLower timings
Post by: jj2007 on January 11, 2022, 01:06:27 PM
This collection is almost 8 years old, but apparently it has never been tested in the Lab :cool:

Intel(R) Core(TM) i5-2450M CPU @ 2.50GHz (SSE4)
+++17 of 20 tests valid, loop overhead is approx. 644/100 cycles

4074    kCycles for 100 * cmp al, new dest
1623    kCycles for 100 * table, new dest
3063    kCycles for 100 * cmp al, in place
1669    kCycles for 100 * xlat, new dest
5571    kCycles for 100 * LevelUp, in place

4528    kCycles for 100 * cmp al, new dest
1618    kCycles for 100 * table, new dest
3631    kCycles for 100 * cmp al, in place
2802    kCycles for 100 * xlat, new dest
5553    kCycles for 100 * LevelUp, in place

4569    kCycles for 100 * cmp al, new dest
1633    kCycles for 100 * table, new dest
3076    kCycles for 100 * cmp al, in place
1669    kCycles for 100 * xlat, new dest
5569    kCycles for 100 * LevelUp, in place

75      bytes for cmp al, new dest
65      bytes for table, new dest
54      bytes for cmp al, in place
35      bytes for xlat, new dest
66      bytes for LevelUp, in place


See here for similar stuff at the FreeBasic forum (https://www.freebasic.net/forum/viewtopic.php?f=3&t=31391&p=288920&sid=487127adb17ebcbd08c5e7493258153e#p288917)
Title: Re: ToUpper & ToLower timings
Post by: LiaoMi on January 11, 2022, 05:33:57 PM
Hi  :tongue:,

11th Gen Intel(R) Core(TM) i7-11800H @ 2.30GHz (SSE4)

1378    kCycles for 100 * cmp al, new dest
660     kCycles for 100 * table, new dest
1142    kCycles for 100 * cmp al, in place
960     kCycles for 100 * xlat, new dest
1395    kCycles for 100 * LevelUp, in place

1425    kCycles for 100 * cmp al, new dest
674     kCycles for 100 * table, new dest
1185    kCycles for 100 * cmp al, in place
1006    kCycles for 100 * xlat, new dest
1425    kCycles for 100 * LevelUp, in place

1365    kCycles for 100 * cmp al, new dest
671     kCycles for 100 * table, new dest
1159    kCycles for 100 * cmp al, in place
1012    kCycles for 100 * xlat, new dest
1472    kCycles for 100 * LevelUp, in place

75      bytes for cmp al, new dest
65      bytes for table, new dest
54      bytes for cmp al, in place
35      bytes for xlat, new dest
66      bytes for LevelUp, in place


--- ok ---
Title: Re: ToUpper & ToLower timings
Post by: daydreamer on January 11, 2022, 11:22:03 PM
Intel(R) Core(TM) i5-7200U CPU @ 2.50GHz (SSE4)
++++++14 of 20 tests valid, loop overhead is approx. 323/100 cycles

4207    kCycles for 100 * cmp al, new dest
1550    kCycles for 100 * table, new dest
1980    kCycles for 100 * cmp al, in place
1785    kCycles for 100 * xlat, new dest
3248    kCycles for 100 * LevelUp, in place

5186    kCycles for 100 * cmp al, new dest
1568    kCycles for 100 * table, new dest
2188    kCycles for 100 * cmp al, in place
1821    kCycles for 100 * xlat, new dest
3231    kCycles for 100 * LevelUp, in place

5047    kCycles for 100 * cmp al, new dest
1667    kCycles for 100 * table, new dest
1964    kCycles for 100 * cmp al, in place
1777    kCycles for 100 * xlat, new dest
3310    kCycles for 100 * LevelUp, in place

75      bytes for cmp al, new dest
65      bytes for table, new dest
54      bytes for cmp al, in place
35      bytes for xlat, new dest
66      bytes for LevelUp, in place


-
Title: Re: ToUpper & ToLower timings
Post by: TimoVJL on January 12, 2022, 12:03:47 AM
AMD Ryzen 5 3400G with Radeon Vega Graphics     (SSE4)
+19 of 20 tests valid, loop overhead is approx. 322/100 cycles

3438    kCycles for 100 * cmp al, new dest
1883    kCycles for 100 * table, new dest
3641    kCycles for 100 * cmp al, in place
2602    kCycles for 100 * xlat, new dest
2967    kCycles for 100 * LevelUp, in place

3917    kCycles for 100 * cmp al, new dest
1882    kCycles for 100 * table, new dest
3654    kCycles for 100 * cmp al, in place
2513    kCycles for 100 * xlat, new dest
2975    kCycles for 100 * LevelUp, in place

3839    kCycles for 100 * cmp al, new dest
1879    kCycles for 100 * table, new dest
3644    kCycles for 100 * cmp al, in place
2618    kCycles for 100 * xlat, new dest
2903    kCycles for 100 * LevelUp, in place

75      bytes for cmp al, new dest
65      bytes for table, new dest
54      bytes for cmp al, in place
35      bytes for xlat, new dest
66      bytes for LevelUp, in place
Title: Re: ToUpper & ToLower timings
Post by: coaster on January 12, 2022, 12:35:09 AM
Intel(R) Core(TM) i7-7660U CPU @ 2.50GHz (SSE4)
+++++15 of 20 tests valid, loop overhead is approx. 402/100 cycles

3153    kCycles for 100 * cmp al, new dest
1174    kCycles for 100 * table, new dest
1435    kCycles for 100 * cmp al, in place
1584    kCycles for 100 * xlat, new dest
3355    kCycles for 100 * LevelUp, in place

3770    kCycles for 100 * cmp al, new dest
1191    kCycles for 100 * table, new dest
1510    kCycles for 100 * cmp al, in place
1737    kCycles for 100 * xlat, new dest
4648    kCycles for 100 * LevelUp, in place

3725    kCycles for 100 * cmp al, new dest
1334    kCycles for 100 * table, new dest
1969    kCycles for 100 * cmp al, in place
2227    kCycles for 100 * xlat, new dest
4367    kCycles for 100 * LevelUp, in place

75      bytes for cmp al, new dest
65      bytes for table, new dest
54      bytes for cmp al, in place
35      bytes for xlat, new dest
66      bytes for LevelUp, in place

Title: Re: ToUpper & ToLower timings
Post by: jj2007 on January 12, 2022, 12:48:01 AM
And the winner is...

xor ecx, ecx
align 4
.Repeat
movzx eax, byte ptr [edx+ecx] ; edx is source string
movzx eax, byte ptr TheTable[eax] ; mov al is equally fast on i5
mov [edi+ecx], al ; edi is destination string
inc ecx
.Until ecx>=SrcBytes


I've added two more table-based solutions, both fast and 28 bytes short:

Intel(R) Core(TM) i5-2450M CPU @ 2.50GHz (SSE4)
++++16 of 20 tests valid, loop overhead is approx. 600/100 cycles

4062    kCycles for 100 * cmp al, new dest
1622    kCycles for 100 * table, new dest
3075    kCycles for 100 * cmp al, in place
1668    kCycles for 100 * xlat, new dest
1638    kCycles for 100 * table, source zero-delimited
1634    kCycles for 100 * table, source zero-delimited, stosb

4057    kCycles for 100 * cmp al, new dest
1617    kCycles for 100 * table, new dest
3058    kCycles for 100 * cmp al, in place
1666    kCycles for 100 * xlat, new dest
1620    kCycles for 100 * table, source zero-delimited
1623    kCycles for 100 * table, source zero-delimited, stosb

4089    kCycles for 100 * cmp al, new dest
1607    kCycles for 100 * table, new dest
3191    kCycles for 100 * cmp al, in place
1702    kCycles for 100 * xlat, new dest
1621    kCycles for 100 * table, source zero-delimited
1626    kCycles for 100 * table, source zero-delimited, stosb

75      bytes for cmp al, new dest
37      bytes for table, new dest
54      bytes for cmp al, in place
35      bytes for xlat, new dest
28      bytes for table, source zero-delimited
28      bytes for table, source zero-delimited, stosb
Title: Re: ToUpper & ToLower timings
Post by: TimoVJL on January 12, 2022, 01:36:18 AM
AMD Ryzen 5 3400G with Radeon Vega Graphics     (SSE4)

3494    kCycles for 100 * cmp al, new dest
1657    kCycles for 100 * table, new dest
2839    kCycles for 100 * cmp al, in place
2547    kCycles for 100 * xlat, new dest
1672    kCycles for 100 * table, source zero-delimited
2977    kCycles for 100 * table, source zero-delimited, stosb

3485    kCycles for 100 * cmp al, new dest
1719    kCycles for 100 * table, new dest
2883    kCycles for 100 * cmp al, in place
2515    kCycles for 100 * xlat, new dest
1635    kCycles for 100 * table, source zero-delimited
2971    kCycles for 100 * table, source zero-delimited, stosb

3440    kCycles for 100 * cmp al, new dest
1681    kCycles for 100 * table, new dest
2804    kCycles for 100 * cmp al, in place
2502    kCycles for 100 * xlat, new dest
1652    kCycles for 100 * table, source zero-delimited
2938    kCycles for 100 * table, source zero-delimited, stosb

75      bytes for cmp al, new dest
37      bytes for table, new dest
54      bytes for cmp al, in place
35      bytes for xlat, new dest
28      bytes for table, source zero-delimited
28      bytes for table, source zero-delimited, stosb
Title: Re: ToUpper & ToLower timings
Post by: six_L on January 12, 2022, 02:36:47 AM
Intel(R) Core(TM) i5-9400H CPU @ 2.50GHz (SSE4)

2941    kCycles for 100 * cmp al, new dest
999     kCycles for 100 * table, new dest
1898    kCycles for 100 * cmp al, in place
1177    kCycles for 100 * xlat, new dest
1183    kCycles for 100 * table, source zero-delimited
1257    kCycles for 100 * table, source zero-delimited, stosb

2994    kCycles for 100 * cmp al, new dest
890     kCycles for 100 * table, new dest
1902    kCycles for 100 * cmp al, in place
1161    kCycles for 100 * xlat, new dest
1171    kCycles for 100 * table, source zero-delimited
1187    kCycles for 100 * table, source zero-delimited, stosb

2946    kCycles for 100 * cmp al, new dest
909     kCycles for 100 * table, new dest
1923    kCycles for 100 * cmp al, in place
1236    kCycles for 100 * xlat, new dest
1352    kCycles for 100 * table, source zero-delimited
1157    kCycles for 100 * table, source zero-delimited, stosb

75      bytes for cmp al, new dest
37      bytes for table, new dest
54      bytes for cmp al, in place
35      bytes for xlat, new dest
28      bytes for table, source zero-delimited
28      bytes for table, source zero-delimited, stosb


--- ok ---
Title: Re: ToUpper & ToLower timings
Post by: LiaoMi on January 12, 2022, 02:37:28 AM
11th Gen Intel(R) Core(TM) i7-11800H @ 2.30GHz (SSE4)
-19 of 20 tests valid, loop overhead is approx. 78/100 cycles

1397    kCycles for 100 * cmp al, new dest
662     kCycles for 100 * table, new dest
1133    kCycles for 100 * cmp al, in place
966     kCycles for 100 * xlat, new dest
850     kCycles for 100 * table, source zero-delimited
907     kCycles for 100 * table, source zero-delimited, stosb

1348    kCycles for 100 * cmp al, new dest
682     kCycles for 100 * table, new dest
1088    kCycles for 100 * cmp al, in place
924     kCycles for 100 * xlat, new dest
864     kCycles for 100 * table, source zero-delimited
884     kCycles for 100 * table, source zero-delimited, stosb

1389    kCycles for 100 * cmp al, new dest
681     kCycles for 100 * table, new dest
1069    kCycles for 100 * cmp al, in place
936     kCycles for 100 * xlat, new dest
852     kCycles for 100 * table, source zero-delimited
861     kCycles for 100 * table, source zero-delimited, stosb

75      bytes for cmp al, new dest
37      bytes for table, new dest
54      bytes for cmp al, in place
35      bytes for xlat, new dest
28      bytes for table, source zero-delimited
28      bytes for table, source zero-delimited, stosb


--- ok ---


11th Gen Intel(R) Core(TM) i7-11800H @ 2.30GHz (SSE4)

1354    kCycles for 100 * cmp al, new dest
652     kCycles for 100 * table, new dest
1084    kCycles for 100 * cmp al, in place
945     kCycles for 100 * xlat, new dest
843     kCycles for 100 * table, source zero-delimited
879     kCycles for 100 * table, source zero-delimited, stosb

1402    kCycles for 100 * cmp al, new dest
671     kCycles for 100 * table, new dest
1110    kCycles for 100 * cmp al, in place
974     kCycles for 100 * xlat, new dest
916     kCycles for 100 * table, source zero-delimited
939     kCycles for 100 * table, source zero-delimited, stosb

1357    kCycles for 100 * cmp al, new dest
648     kCycles for 100 * table, new dest
1239    kCycles for 100 * cmp al, in place
995     kCycles for 100 * xlat, new dest
871     kCycles for 100 * table, source zero-delimited
871     kCycles for 100 * table, source zero-delimited, stosb

75      bytes for cmp al, new dest
37      bytes for table, new dest
54      bytes for cmp al, in place
35      bytes for xlat, new dest
28      bytes for table, source zero-delimited
28      bytes for table, source zero-delimited, stosb
Title: Re: ToUpper & ToLower timings
Post by: Biterider on January 12, 2022, 07:54:16 AM

Intel(R) Core(TM) i7-4770K CPU @ 3.50GHz (SSE4)

4736    kCycles for 100 * cmp al, new dest
1600    kCycles for 100 * table, new dest
2282    kCycles for 100 * cmp al, in place
2394    kCycles for 100 * xlat, new dest
1705    kCycles for 100 * table, source zero-delimited
2012    kCycles for 100 * table, source zero-delimited, stosb

4969    kCycles for 100 * cmp al, new dest
1782    kCycles for 100 * table, new dest
2606    kCycles for 100 * cmp al, in place
2197    kCycles for 100 * xlat, new dest
1927    kCycles for 100 * table, source zero-delimited
2170    kCycles for 100 * table, source zero-delimited, stosb

4848    kCycles for 100 * cmp al, new dest
3238    kCycles for 100 * table, new dest
2276    kCycles for 100 * cmp al, in place
1797    kCycles for 100 * xlat, new dest
1692    kCycles for 100 * table, source zero-delimited
1916    kCycles for 100 * table, source zero-delimited, stosb

75      bytes for cmp al, new dest
37      bytes for table, new dest
54      bytes for cmp al, in place
35      bytes for xlat, new dest
28      bytes for table, source zero-delimited
28      bytes for table, source zero-delimited, stosb


-


Biterider
Title: Re: ToUpper & ToLower timings
Post by: hutch-- on January 12, 2022, 11:04:35 AM
I just wonder how many terabytes you would need to see the difference.  :tongue:

Intel(R) Core(TM) i7-5820K CPU @ 3.30GHz (SSE4)

3931    kCycles for 100 * cmp al, new dest
1224    kCycles for 100 * table, new dest
1869    kCycles for 100 * cmp al, in place
2007    kCycles for 100 * xlat, new dest
5581    kCycles for 100 * LevelUp, in place

4133    kCycles for 100 * cmp al, new dest
1225    kCycles for 100 * table, new dest
2107    kCycles for 100 * cmp al, in place
1735    kCycles for 100 * xlat, new dest
5706    kCycles for 100 * LevelUp, in place

4605    kCycles for 100 * cmp al, new dest
1605    kCycles for 100 * table, new dest
1844    kCycles for 100 * cmp al, in place
1555    kCycles for 100 * xlat, new dest
5506    kCycles for 100 * LevelUp, in place

75      bytes for cmp al, new dest
65      bytes for table, new dest
54      bytes for cmp al, in place
35      bytes for xlat, new dest
66      bytes for LevelUp, in place
Title: Re: ToUpper & ToLower timings
Post by: NoCforMe on January 12, 2022, 11:09:51 AM
Pardon me as I intrude here, as an amateur, but let me get this straight: we're talking about ToUpper() which turns any ASCII alpha characters into uppercase? Is that right?

If so, the method being used seems waaay too complex. Here's mine, which I've been using for decades now:

;============================================
; ToUpper()
;
; If character in AL is alphabetic, uppercases it.
;============================================

ToUpper PROC
CMP AL, 'a'
JB tu99
CMP AL, 'z'
JA tu99
AND AL, 5FH
tu99: RET

ToUpper ENDP


Just twiddle a couple bits is all. (ToLower() would operate similarly, by setting rather than clearing bits.)

Any objections? or am I missing something obvious here?
Title: Re: ToUpper & ToLower timings
Post by: jj2007 on January 12, 2022, 11:30:39 AM
NoCforMe,

Your routine works for a single char, while the algos in the testbed translate a whole string to uppercase. The first one called "cmp al, new dest" does exactly what yours is doing. However, this one (using a table) is over twice as fast, and very short:

mov edx, offset Src
mov edi, offset Dest
xor eax, eax
align 4
.While 1
movzx ecx, byte ptr [edx+eax]
jecxz @out
movzx ecx, byte ptr TheTable[ecx]
mov [edi+eax], cl
inc eax
.Endw
@out:


As Hutch writes above, there is rarely a need for so much speed. We are doing this just for fun :cool:
Title: Re: ToUpper & ToLower timings
Post by: Vortex on January 13, 2022, 04:48:33 AM
Hi Jochen,

Thanks, I liked your code creating the lookup table :thumbsup: Better and more elegant than my handcrafted tables.
Title: Re: ToUpper & ToLower timings
Post by: hutch-- on January 13, 2022, 04:55:10 AM
JJ,

If you have the time, clock these two out of the library.

; ¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤

OPTION PROLOGUE:NONE
OPTION EPILOGUE:NONE

align 4

szLower proc text:DWORD

  ; -----------------------------
  ; converts string to lower case
  ; invoke szLower,ADDR szString
  ; -----------------------------

    mov eax, [esp+4]
    dec eax

  @@:
    add eax, 1
    cmp BYTE PTR [eax], 0
    je @F
    cmp BYTE PTR [eax], "A"
    jb @B
    cmp BYTE PTR [eax], "Z"
    ja @B
    add BYTE PTR [eax], 32
    jmp @B
  @@:

    mov eax, [esp+4]

    ret 4

szLower endp

OPTION PROLOGUE:PrologueDef
OPTION EPILOGUE:EpilogueDef

; ¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤



; ¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤

OPTION PROLOGUE:NONE
OPTION EPILOGUE:NONE

align 4

szUpper proc text:DWORD

  ; -----------------------------
  ; converts string to upper case
  ; invoke szUpper,ADDR szString
  ; -----------------------------

    mov eax, [esp+4]
    dec eax

  @@:
    add eax, 1
    cmp BYTE PTR [eax], 0
    je @F
    cmp BYTE PTR [eax], "a"
    jb @B
    cmp BYTE PTR [eax], "z"
    ja @B
    sub BYTE PTR [eax], 32
    jmp @B
  @@:

    mov eax, [esp+4]

    ret 4

szUpper endp

OPTION PROLOGUE:PrologueDef
OPTION EPILOGUE:EpilogueDef

; ¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤
Title: Re: ToUpper & ToLower timings
Post by: jj2007 on January 13, 2022, 06:24:14 AM
Quote from: Vortex on January 13, 2022, 04:48:33 AM
Hi Jochen,

Thanks, I liked your code creating the lookup table :thumbsup: Better and more elegant than my handcrafted tables.

Hi Erol,

It's called "cheating", but it fits the purpose ;-)
Title: Re: ToUpper & ToLower timings
Post by: jj2007 on January 13, 2022, 06:37:28 AM
Quote from: hutch-- on January 13, 2022, 04:55:10 AM
JJ,

If you have the time, clock these two out of the library.

I have a suspicion that szUpper profits from the fact that it converts "in place". So in run #2 it's already all uppercase. But why is szLower so slow then? Mysteries :sad:

Intel(R) Core(TM) i5-2450M CPU @ 2.50GHz (SSE4)
+19 of 20 tests valid, loop overhead is approx. 631/100 cycles

4066    kCycles for 100 * cmp al, new dest
1606    kCycles for 100 * table, new dest
3050    kCycles for 100 * cmp al, in place
1659    kCycles for 100 * xlat, new dest
1614    kCycles for 100 * table, source zero-delimited
1612    kCycles for 100 * table, source zero-delimited, stosb
6310    kCycles for 100 * szLower
1610    kCycles for 100 * szUpper

2198    kCycles for 100 * cmp al, new dest
1626    kCycles for 100 * table, new dest
3139    kCycles for 100 * cmp al, in place
1657    kCycles for 100 * xlat, new dest
1610    kCycles for 100 * table, source zero-delimited
1631    kCycles for 100 * table, source zero-delimited, stosb
6308    kCycles for 100 * szLower
1615    kCycles for 100 * szUpper

2192    kCycles for 100 * cmp al, new dest
1613    kCycles for 100 * table, new dest
3049    kCycles for 100 * cmp al, in place
1667    kCycles for 100 * xlat, new dest
1613    kCycles for 100 * table, source zero-delimited
1617    kCycles for 100 * table, source zero-delimited, stosb
6359    kCycles for 100 * szLower
1612    kCycles for 100 * szUpper

75      bytes for cmp al, new dest
37      bytes for table, new dest
54      bytes for cmp al, in place
35      bytes for xlat, new dest
28      bytes for table, source zero-delimited
28      bytes for table, source zero-delimited, stosb
7       bytes for szLower
7       bytes for szUpper
Title: Re: ToUpper & ToLower timings
Post by: nidud on January 13, 2022, 07:23:05 AM
deleted
Title: Re: ToUpper & ToLower timings
Post by: hutch-- on January 13, 2022, 09:33:43 AM
I have no idea why they differ so much.

Intel(R) Core(TM) i7-5820K CPU @ 3.30GHz (SSE4)

3954    kCycles for 100 * cmp al, new dest
1281    kCycles for 100 * table, new dest
1896    kCycles for 100 * cmp al, in place
1603    kCycles for 100 * xlat, new dest
1490    kCycles for 100 * table, source zero-delimited
1693    kCycles for 100 * table, source zero-delimited, stosb
6283    kCycles for 100 * szLower
1541    kCycles for 100 * szUpper

1989    kCycles for 100 * cmp al, new dest
1304    kCycles for 100 * table, new dest
1897    kCycles for 100 * cmp al, in place
1603    kCycles for 100 * xlat, new dest
1482    kCycles for 100 * table, source zero-delimited
1695    kCycles for 100 * table, source zero-delimited, stosb
6282    kCycles for 100 * szLower
1618    kCycles for 100 * szUpper

1990    kCycles for 100 * cmp al, new dest
1280    kCycles for 100 * table, new dest
1898    kCycles for 100 * cmp al, in place
1608    kCycles for 100 * xlat, new dest
1479    kCycles for 100 * table, source zero-delimited
1693    kCycles for 100 * table, source zero-delimited, stosb
6295    kCycles for 100 * szLower
1618    kCycles for 100 * szUpper

75      bytes for cmp al, new dest
37      bytes for table, new dest
54      bytes for cmp al, in place
35      bytes for xlat, new dest
28      bytes for table, source zero-delimited
28      bytes for table, source zero-delimited, stosb
7       bytes for szLower
7       bytes for szUpper
Title: Re: ToUpper & ToLower timings
Post by: jj2007 on January 13, 2022, 11:03:17 AM
Quote from: hutch-- on January 13, 2022, 09:33:43 AM
I have no idea why they differ so much.

It's the "in place" thing:
szLower:
    cmp BYTE PTR [eax], 0   ; after the first iteration, it's a lowerstring
    je @F
    cmp BYTE PTR [eax], "A"   ; nope
    jb @B
    cmp BYTE PTR [eax], "Z"   ; this branch will be taken
    ja @B

szUpper:
    cmp BYTE PTR [eax], 0   ; after the first iteration, it's an upperstring
    je @F
    cmp BYTE PTR [eax], "a"   ; this branch will be taken
    jb @B
    cmp BYTE PTR [eax], "z"
    ja @B

Now this is weird... a big improvement for szLower, partly because I changed the AZ order, but mostly because of a single instruction change:

Intel(R) Core(TM) i5-2450M CPU @ 2.50GHz (SSE4)

4076    kCycles for 100 * cmp al, new dest
1614    kCycles for 100 * table, new dest
3152    kCycles for 100 * cmp al, in place
1657    kCycles for 100 * xlat, new dest
1613    kCycles for 100 * table, source zero-delimited
1617    kCycles for 100 * table, source zero-delimited, stosb
1827    kCycles for 100 * szLower2
1635    kCycles for 100 * szUpper2

2189    kCycles for 100 * cmp al, new dest
1613    kCycles for 100 * table, new dest
3143    kCycles for 100 * cmp al, in place
1688    kCycles for 100 * xlat, new dest
1623    kCycles for 100 * table, source zero-delimited
1622    kCycles for 100 * table, source zero-delimited, stosb
1806    kCycles for 100 * szLower2
1616    kCycles for 100 * szUpper2


Big improvement for szLower, right?
OPTION PROLOGUE:NONE
OPTION EPILOGUE:NONE

align 4

szLower2 proc text:DWORD

  ; -----------------------------
  ; converts string to lower case
  ; invoke szLower,ADDR szString
  ; -----------------------------

    mov eax, [esp+4]
    dec eax
; align 2 ; much, much slower on my i5
  @@:
  if 1
    inc eax ; much, much faster than add eax, 1
  else
    add eax, 1
  endif
    cmp BYTE PTR [eax], 0
    je @F
    cmp BYTE PTR [eax], "Z"
    ja @B
    cmp BYTE PTR [eax], "A"
    jb @B
    or BYTE PTR [eax], 32
    jmp @B
  @@:

    mov eax, [esp+4]

    ret 4

szLower2 endp


Btw the source assembles without MasmBasic, it's plain Masm32 SDK :cool:
Title: Re: ToUpper & ToLower timings
Post by: LiaoMi on January 14, 2022, 05:13:03 AM
Quote from: jj2007 on January 13, 2022, 11:03:17 AM
Quote from: hutch-- on January 13, 2022, 09:33:43 AM
I have no idea why they differ so much.

It's the "in place" thing:
szLower:
    cmp BYTE PTR [eax], 0   ; after the first iteration, it's a lowerstring
    je @F
    cmp BYTE PTR [eax], "A"   ; nope
    jb @B
    cmp BYTE PTR [eax], "Z"   ; this branch will be taken
    ja @B

szUpper:
    cmp BYTE PTR [eax], 0   ; after the first iteration, it's an upperstring
    je @F
    cmp BYTE PTR [eax], "a"   ; this branch will be taken
    jb @B
    cmp BYTE PTR [eax], "z"
    ja @B

Now this is weird... a big improvement for szLower, partly because I changed the AZ order, but mostly because of a single instruction change:

Big improvement for szLower, right?
OPTION PROLOGUE:NONE
OPTION EPILOGUE:NONE

align 4

szLower2 proc text:DWORD

  ; -----------------------------
  ; converts string to lower case
  ; invoke szLower,ADDR szString
  ; -----------------------------

    mov eax, [esp+4]
    dec eax
; align 2 ; much, much slower on my i5
  @@:
  if 1
    inc eax ; much, much faster than add eax, 1
  else
    add eax, 1
  endif
    cmp BYTE PTR [eax], 0
    je @F
    cmp BYTE PTR [eax], "Z"
    ja @B
    cmp BYTE PTR [eax], "A"
    jb @B
    or BYTE PTR [eax], 32
    jmp @B
  @@:

    mov eax, [esp+4]

    ret 4

szLower2 endp


Btw the source assembles without MasmBasic, it's plain Masm32 SDK :cool:

11th Gen Intel(R) Core(TM) i7-11800H @ 2.30GHz (SSE4)

1441    kCycles for 100 * cmp al, new dest
705     kCycles for 100 * table, new dest
1147    kCycles for 100 * cmp al, in place
981     kCycles for 100 * xlat, new dest
887     kCycles for 100 * table, source zero-delimited
914     kCycles for 100 * table, source zero-delimited, stosb
1088    kCycles for 100 * szLower2
952     kCycles for 100 * szUpper2

938     kCycles for 100 * cmp al, new dest
706     kCycles for 100 * table, new dest
1142    kCycles for 100 * cmp al, in place
974     kCycles for 100 * xlat, new dest
963     kCycles for 100 * table, source zero-delimited
914     kCycles for 100 * table, source zero-delimited, stosb
1107    kCycles for 100 * szLower2
974     kCycles for 100 * szUpper2

1013    kCycles for 100 * cmp al, new dest
702     kCycles for 100 * table, new dest
1162    kCycles for 100 * cmp al, in place
978     kCycles for 100 * xlat, new dest
893     kCycles for 100 * table, source zero-delimited
956     kCycles for 100 * table, source zero-delimited, stosb
1124    kCycles for 100 * szLower2
953     kCycles for 100 * szUpper2

75      bytes for cmp al, new dest
37      bytes for table, new dest
54      bytes for cmp al, in place
35      bytes for xlat, new dest
28      bytes for table, source zero-delimited
28      bytes for table, source zero-delimited, stosb
7       bytes for szLower2
7       bytes for szUpper2


--- ok ---
Title: Re: ToUpper & ToLower timings
Post by: LiaoMi on January 14, 2022, 05:14:40 AM
Quote from: nidud on January 13, 2022, 07:23:05 AM
They will probably be similar with a small advantage for the table.

11th Gen Intel(R) Core(TM) i7-11800H @ 2.30GHz (AVX512)
----------------------------------------------
-- test(1)
    29771 cycles, rep(1000), code( 29) 0.asm: cmp
    38149 cycles, rep(1000), code(288) 1.asm: table
    28763 cycles, rep(1000), code(304) 2.asm: cmp+table
-- test(2)
    29992 cycles, rep(1000), code( 29) 0.asm: cmp
    40331 cycles, rep(1000), code(288) 1.asm: table
    25967 cycles, rep(1000), code(304) 2.asm: cmp+table
-- test(3)
    29737 cycles, rep(1000), code( 29) 0.asm: cmp
    40168 cycles, rep(1000), code(288) 1.asm: table
    29103 cycles, rep(1000), code(304) 2.asm: cmp+table
-- test(4)
    30073 cycles, rep(1000), code( 29) 0.asm: cmp
    41148 cycles, rep(1000), code(288) 1.asm: table
    28872 cycles, rep(1000), code(304) 2.asm: cmp+table

total [1 .. 4], 1++
   112705 cycles 2.asm: cmp+table
   119573 cycles 0.asm: cmp
   159796 cycles 1.asm: table
hit any key to continue...
Title: Re: ToUpper & ToLower timings
Post by: TimoVJL on January 14, 2022, 05:20:11 AM
AMD Ryzen 5 3400G with Radeon Vega Graphics     (SSE4)

3243    kCycles for 100 * cmp al, new dest
1348    kCycles for 100 * table, new dest
2541    kCycles for 100 * cmp al, in place
2202    kCycles for 100 * xlat, new dest
1419    kCycles for 100 * table, source zero-delimited
2641    kCycles for 100 * table, source zero-delimited, stosb
1782    kCycles for 100 * szLower2
1015    kCycles for 100 * szUpper2

1963    kCycles for 100 * cmp al, new dest
1362    kCycles for 100 * table, new dest
2541    kCycles for 100 * cmp al, in place
2201    kCycles for 100 * xlat, new dest
1352    kCycles for 100 * table, source zero-delimited
2632    kCycles for 100 * table, source zero-delimited, stosb
1782    kCycles for 100 * szLower2
981     kCycles for 100 * szUpper2

1983    kCycles for 100 * cmp al, new dest
1344    kCycles for 100 * table, new dest
2517    kCycles for 100 * cmp al, in place
2196    kCycles for 100 * xlat, new dest
1407    kCycles for 100 * table, source zero-delimited
2636    kCycles for 100 * table, source zero-delimited, stosb
1745    kCycles for 100 * szLower2
989     kCycles for 100 * szUpper2

75      bytes for cmp al, new dest
37      bytes for table, new dest
54      bytes for cmp al, in place
35      bytes for xlat, new dest
28      bytes for table, source zero-delimited
28      bytes for table, source zero-delimited, stosb
7       bytes for szLower2
7       bytes for szUpper2
AMD Ryzen 5 3400G with Radeon Vega Graphics     (AVX2)
----------------------------------------------
-- test(1)
    77637 cycles, rep(1000), code( 29) 0.asm: cmp
    58463 cycles, rep(1000), code(288) 1.asm: table
    85323 cycles, rep(1000), code(304) 2.asm: cmp+table
-- test(2)
    90772 cycles, rep(1000), code( 29) 0.asm: cmp
    68947 cycles, rep(1000), code(288) 1.asm: table
    99731 cycles, rep(1000), code(304) 2.asm: cmp+table
-- test(3)
    95322 cycles, rep(1000), code( 29) 0.asm: cmp
    67971 cycles, rep(1000), code(288) 1.asm: table
    95539 cycles, rep(1000), code(304) 2.asm: cmp+table
-- test(4)
    98111 cycles, rep(1000), code( 29) 0.asm: cmp
    73314 cycles, rep(1000), code(288) 1.asm: table
    92986 cycles, rep(1000), code(304) 2.asm: cmp+table

total [1 .. 4], 1++
   268695 cycles 1.asm: table
   361842 cycles 0.asm: cmp
   373579 cycles 2.asm: cmp+table
Title: Re: ToUpper & ToLower timings
Post by: Vortex on January 14, 2022, 06:03:05 AM
Hi Nidud,

Your lookup table function can be made faster by eliminating the first jump coming after test edx,edx :

include \masm32\include64\masm64rt.inc

    .data

table_up label sbyte

i = 0

while i lt 256

    if (i ge 'a') and (i le 'z')
        db i and not ' '
    else
        db i
    endif
    i = i + 1

    endm

s db 'This IS a test function.',0

.code

UpperCase PROC string:QWORD

    lea     r8,table_up
    mov     rax,rcx
    dec     rcx

_loop:

    inc     rcx
    movzx   edx,byte ptr [rcx]

    mov     r9b,[r8+rdx]
    mov     [rcx],r9b
    test    edx,edx
    jnz     _loop
    ret

UpperCase ENDP

start PROC

    invoke  UpperCase,ADDR s
    invoke  StdOut,ADDR s
    invoke  ExitProcess,0

start ENDP

END
Title: Re: ToUpper & ToLower timings
Post by: jj2007 on January 14, 2022, 06:29:43 AM
@all: thanks :thup:

I guess you all realise that szLower2 and szUpper2 look fast because they are in-place algos. That is, after one iteration, the remaining 99 iterations are performed on an already converted string. And that is way faster because it needs only one cmp, and jumps immediately.

I did a test to overcome this problem, as follows:
NameH equ szUpper2+szLower
TestH proc
  mov ebx, AlgoLoops/2-1 ; loop e.g. 100x
  align 4
  .Repeat
invoke szLower2, offset Src
invoke szUpper2, offset Src
dec ebx
  .Until Sign?
  ret
TestH endp


Intel(R) Core(TM) i5-2450M CPU @ 2.50GHz (SSE4)
+19 of 20 tests valid, loop overhead is approx. 395/100 cycles

4988    kCycles for 100 * cmp al, new dest
1970    kCycles for 100 * table, new dest
3732    kCycles for 100 * cmp al, in place
2026    kCycles for 100 * xlat, new dest
1973    kCycles for 100 * table, source zero-delimited
1975    kCycles for 100 * table, source zero-delimited, stosb
2205    kCycles for 100 * szLower2
8608    kCycles for 100 * szUpper2+szLower

2669    kCycles for 100 * cmp al, new dest
1970    kCycles for 100 * table, new dest
3844    kCycles for 100 * cmp al, in place
2029    kCycles for 100 * xlat, new dest
1971    kCycles for 100 * table, source zero-delimited
1974    kCycles for 100 * table, source zero-delimited, stosb
2209    kCycles for 100 * szLower2
8605    kCycles for 100 * szUpper2+szLower
Title: Re: ToUpper & ToLower timings
Post by: guga on January 14, 2022, 04:49:58 PM
AMD Ryzen 5 2400G with Radeon Vega Graphics     (SSE4)

3548    kCycles for 100 * cmp al, new dest
1512    kCycles for 100 * table, new dest
2841    kCycles for 100 * cmp al, in place
2469    kCycles for 100 * xlat, new dest
1507    kCycles for 100 * table, source zero-delimited
2846    kCycles for 100 * table, source zero-delimited, stosb
2360    kCycles for 100 * szLower2
1461    kCycles for 100 * szUpper2

2584    kCycles for 100 * cmp al, new dest
1503    kCycles for 100 * table, new dest
2795    kCycles for 100 * cmp al, in place
2459    kCycles for 100 * xlat, new dest
1543    kCycles for 100 * table, source zero-delimited
2903    kCycles for 100 * table, source zero-delimited, stosb
2195    kCycles for 100 * szLower2
1140    kCycles for 100 * szUpper2

2179    kCycles for 100 * cmp al, new dest
1464    kCycles for 100 * table, new dest
2761    kCycles for 100 * cmp al, in place
3423    kCycles for 100 * xlat, new dest
2219    kCycles for 100 * table, source zero-delimited
3214    kCycles for 100 * table, source zero-delimited, stosb
1955    kCycles for 100 * szLower2
1022    kCycles for 100 * szUpper2

75      bytes for cmp al, new dest
37      bytes for table, new dest
54      bytes for cmp al, in place
35      bytes for xlat, new dest
28      bytes for table, source zero-delimited
28      bytes for table, source zero-delimited, stosb
7       bytes for szLower2
7       bytes for szUpper2


--- ok ---
Title: Re: ToUpper & ToLower timings
Post by: quarantined on January 16, 2022, 08:00:23 AM
Intel(R) Core(TM)2 Duo CPU     E8400  @ 3.00GHz (SSE4)

8409    kCycles for 100 * cmp al, new dest
2725    kCycles for 100 * table, new dest
7996    kCycles for 100 * cmp al, in place
2952    kCycles for 100 * xlat, new dest
8216    kCycles for 100 * LevelUp, in place

8272    kCycles for 100 * cmp al, new dest
2679    kCycles for 100 * table, new dest
8048    kCycles for 100 * cmp al, in place
2961    kCycles for 100 * xlat, new dest
8161    kCycles for 100 * LevelUp, in place

8218    kCycles for 100 * cmp al, new dest
2691    kCycles for 100 * table, new dest
8011    kCycles for 100 * cmp al, in place
2957    kCycles for 100 * xlat, new dest
8178    kCycles for 100 * LevelUp, in place

75      bytes for cmp al, new dest
65      bytes for table, new dest
54      bytes for cmp al, in place
35      bytes for xlat, new dest
66      bytes for LevelUp, in place


--- ok ---

Intel(R) Core(TM)2 Duo CPU     E8400  @ 3.00GHz (SSE4)

8418    kCycles for 100 * cmp al, new dest
2556    kCycles for 100 * table, new dest
8051    kCycles for 100 * cmp al, in place
2949    kCycles for 100 * xlat, new dest
2958    kCycles for 100 * table, source zero-delimited
2968    kCycles for 100 * table, source zero-delimited, stosb
7249    kCycles for 100 * szLower
1976    kCycles for 100 * szUpper

5304    kCycles for 100 * cmp al, new dest
2352    kCycles for 100 * table, new dest
8046    kCycles for 100 * cmp al, in place
3097    kCycles for 100 * xlat, new dest
2939    kCycles for 100 * table, source zero-delimited
2949    kCycles for 100 * table, source zero-delimited, stosb
7098    kCycles for 100 * szLower
1975    kCycles for 100 * szUpper

5284    kCycles for 100 * cmp al, new dest
2316    kCycles for 100 * table, new dest
8296    kCycles for 100 * cmp al, in place
3099    kCycles for 100 * xlat, new dest
3115    kCycles for 100 * table, source zero-delimited
2970    kCycles for 100 * table, source zero-delimited, stosb
7079    kCycles for 100 * szLower
1969    kCycles for 100 * szUpper

75      bytes for cmp al, new dest
37      bytes for table, new dest
54      bytes for cmp al, in place
35      bytes for xlat, new dest
28      bytes for table, source zero-delimited
28      bytes for table, source zero-delimited, stosb
7       bytes for szLower
7       bytes for szUpper


--- ok ---