ToUpper & ToLower timings

jj2007 · January 13, 2022, 06:24:14 AM

Quote from: Vortex on January 13, 2022, 04:48:33 AM
Hi Jochen,

Thanks, I liked your code creating the lookup table Better and more elegant than my handcrafted tables.

Hi Erol,

It's called "cheating", but it fits the purpose ;-)

jj2007 · January 13, 2022, 06:37:28 AM

Quote from: hutch-- on January 13, 2022, 04:55:10 AM
JJ,

If you have the time, clock these two out of the library.

I have a suspicion that szUpper profits from the fact that it converts "in place". So in run #2 it's already all uppercase. But why is szLower so slow then? Mysteries

Code Select

Intel(R) Core(TM) i5-2450M CPU @ 2.50GHz (SSE4)
+19 of 20 tests valid, loop overhead is approx. 631/100 cycles

4066    kCycles for 100 * cmp al, new dest
1606    kCycles for 100 * table, new dest
3050    kCycles for 100 * cmp al, in place
1659    kCycles for 100 * xlat, new dest
1614    kCycles for 100 * table, source zero-delimited
1612    kCycles for 100 * table, source zero-delimited, stosb
6310    kCycles for 100 * szLower
1610    kCycles for 100 * szUpper

2198    kCycles for 100 * cmp al, new dest
1626    kCycles for 100 * table, new dest
3139    kCycles for 100 * cmp al, in place
1657    kCycles for 100 * xlat, new dest
1610    kCycles for 100 * table, source zero-delimited
1631    kCycles for 100 * table, source zero-delimited, stosb
6308    kCycles for 100 * szLower
1615    kCycles for 100 * szUpper

2192    kCycles for 100 * cmp al, new dest
1613    kCycles for 100 * table, new dest
3049    kCycles for 100 * cmp al, in place
1667    kCycles for 100 * xlat, new dest
1613    kCycles for 100 * table, source zero-delimited
1617    kCycles for 100 * table, source zero-delimited, stosb
6359    kCycles for 100 * szLower
1612    kCycles for 100 * szUpper

75      bytes for cmp al, new dest
37      bytes for table, new dest
54      bytes for cmp al, in place
35      bytes for xlat, new dest
28      bytes for table, source zero-delimited
28      bytes for table, source zero-delimited, stosb
7       bytes for szLower
7       bytes for szUpper

nidud · January 13, 2022, 07:23:05 AM

deleted

hutch-- · January 13, 2022, 09:33:43 AM

I have no idea why they differ so much.

Intel(R) Core(TM) i7-5820K CPU @ 3.30GHz (SSE4)

3954 kCycles for 100 * cmp al, new dest
1281 kCycles for 100 * table, new dest
1896 kCycles for 100 * cmp al, in place
1603 kCycles for 100 * xlat, new dest
1490 kCycles for 100 * table, source zero-delimited
1693 kCycles for 100 * table, source zero-delimited, stosb
6283 kCycles for 100 * szLower
1541 kCycles for 100 * szUpper

1989 kCycles for 100 * cmp al, new dest
1304 kCycles for 100 * table, new dest
1897 kCycles for 100 * cmp al, in place
1603 kCycles for 100 * xlat, new dest
1482 kCycles for 100 * table, source zero-delimited
1695 kCycles for 100 * table, source zero-delimited, stosb
6282 kCycles for 100 * szLower
1618 kCycles for 100 * szUpper

1990 kCycles for 100 * cmp al, new dest
1280 kCycles for 100 * table, new dest
1898 kCycles for 100 * cmp al, in place
1608 kCycles for 100 * xlat, new dest
1479 kCycles for 100 * table, source zero-delimited
1693 kCycles for 100 * table, source zero-delimited, stosb
6295 kCycles for 100 * szLower
1618 kCycles for 100 * szUpper

75 bytes for cmp al, new dest
37 bytes for table, new dest
54 bytes for cmp al, in place
35 bytes for xlat, new dest
28 bytes for table, source zero-delimited
28 bytes for table, source zero-delimited, stosb
7 bytes for szLower
7 bytes for szUpper

jj2007 · January 13, 2022, 11:03:17 AM

Quote from: hutch-- on January 13, 2022, 09:33:43 AM
I have no idea why they differ so much.

It's the "in place" thing:
szLower:
cmp BYTE PTR [eax], 0 ; after the first iteration, it's a lowerstring
je @F
cmp BYTE PTR [eax], "A" ; nope
jb @B
cmp BYTE PTR [eax], "Z" ; this branch will be taken
ja @B

szUpper:
cmp BYTE PTR [eax], 0 ; after the first iteration, it's an upperstring
je @F
cmp BYTE PTR [eax], "a" ; this branch will be taken
jb @B
cmp BYTE PTR [eax], "z"
ja @B

Now this is weird... a big improvement for szLower, partly because I changed the AZ order, but mostly because of a single instruction change:

Code Select

Intel(R) Core(TM) i5-2450M CPU @ 2.50GHz (SSE4)

4076    kCycles for 100 * cmp al, new dest
1614    kCycles for 100 * table, new dest
3152    kCycles for 100 * cmp al, in place
1657    kCycles for 100 * xlat, new dest
1613    kCycles for 100 * table, source zero-delimited
1617    kCycles for 100 * table, source zero-delimited, stosb
1827    kCycles for 100 * szLower2
1635    kCycles for 100 * szUpper2

2189    kCycles for 100 * cmp al, new dest
1613    kCycles for 100 * table, new dest
3143    kCycles for 100 * cmp al, in place
1688    kCycles for 100 * xlat, new dest
1623    kCycles for 100 * table, source zero-delimited
1622    kCycles for 100 * table, source zero-delimited, stosb
1806    kCycles for 100 * szLower2
1616    kCycles for 100 * szUpper2

Big improvement for szLower, right?

Code Select

OPTION PROLOGUE:NONE
OPTION EPILOGUE:NONE

align 4

szLower2 proc text:DWORD

  ; -----------------------------
  ; converts string to lower case
  ; invoke szLower,ADDR szString
  ; -----------------------------

    mov eax, [esp+4]
    dec eax
; align 2	; much, much slower on my i5
  @@:
  if 1
    inc eax	; much, much faster than add eax, 1
  else
    add eax, 1
  endif 
    cmp BYTE PTR [eax], 0
    je @F
    cmp BYTE PTR [eax], "Z"
    ja @B
    cmp BYTE PTR [eax], "A"
    jb @B
    or BYTE PTR [eax], 32
    jmp @B
  @@:

    mov eax, [esp+4]

    ret 4

szLower2 endp

Btw the source assembles without MasmBasic, it's plain Masm32 SDK

LiaoMi · January 14, 2022, 05:13:03 AM

Quote from: jj2007 on January 13, 2022, 11:03:17 AM
Quote from: hutch-- on January 13, 2022, 09:33:43 AM
I have no idea why they differ so much.

It's the "in place" thing:
szLower:
cmp BYTE PTR [eax], 0 ; after the first iteration, it's a lowerstring
je @F
cmp BYTE PTR [eax], "A" ; nope
jb @B
cmp BYTE PTR [eax], "Z" ; this branch will be taken
ja @B

szUpper:
cmp BYTE PTR [eax], 0 ; after the first iteration, it's an upperstring
je @F
cmp BYTE PTR [eax], "a" ; this branch will be taken
jb @B
cmp BYTE PTR [eax], "z"
ja @B

Now this is weird... a big improvement for szLower, partly because I changed the AZ order, but mostly because of a single instruction change:

Big improvement for szLower, right?
Code Select Expand
OPTION PROLOGUE:NONE OPTION EPILOGUE:NONE align 4 szLower2 proc text:DWORD ; ----------------------------- ; converts string to lower case ; invoke szLower,ADDR szString ; ----------------------------- mov eax, [esp+4] dec eax ; align 2 ; much, much slower on my i5 @@: if 1 inc eax ; much, much faster than add eax, 1 else add eax, 1 endif cmp BYTE PTR [eax], 0 je @F cmp BYTE PTR [eax], "Z" ja @B cmp BYTE PTR [eax], "A" jb @B or BYTE PTR [eax], 32 jmp @B @@: mov eax, [esp+4] ret 4 szLower2 endp

Btw the source assembles without MasmBasic, it's plain Masm32 SDK

Code Select

11th Gen Intel(R) Core(TM) i7-11800H @ 2.30GHz (SSE4)

1441    kCycles for 100 * cmp al, new dest
705     kCycles for 100 * table, new dest
1147    kCycles for 100 * cmp al, in place
981     kCycles for 100 * xlat, new dest
887     kCycles for 100 * table, source zero-delimited
914     kCycles for 100 * table, source zero-delimited, stosb
1088    kCycles for 100 * szLower2
952     kCycles for 100 * szUpper2

938     kCycles for 100 * cmp al, new dest
706     kCycles for 100 * table, new dest
1142    kCycles for 100 * cmp al, in place
974     kCycles for 100 * xlat, new dest
963     kCycles for 100 * table, source zero-delimited
914     kCycles for 100 * table, source zero-delimited, stosb
1107    kCycles for 100 * szLower2
974     kCycles for 100 * szUpper2

1013    kCycles for 100 * cmp al, new dest
702     kCycles for 100 * table, new dest
1162    kCycles for 100 * cmp al, in place
978     kCycles for 100 * xlat, new dest
893     kCycles for 100 * table, source zero-delimited
956     kCycles for 100 * table, source zero-delimited, stosb
1124    kCycles for 100 * szLower2
953     kCycles for 100 * szUpper2

75      bytes for cmp al, new dest
37      bytes for table, new dest
54      bytes for cmp al, in place
35      bytes for xlat, new dest
28      bytes for table, source zero-delimited
28      bytes for table, source zero-delimited, stosb
7       bytes for szLower2
7       bytes for szUpper2


--- ok ---

LiaoMi · January 14, 2022, 05:14:40 AM

Quote from: nidud on January 13, 2022, 07:23:05 AM
They will probably be similar with a small advantage for the table.

Code Select

11th Gen Intel(R) Core(TM) i7-11800H @ 2.30GHz (AVX512)
----------------------------------------------
-- test(1)
    29771 cycles, rep(1000), code( 29) 0.asm: cmp
    38149 cycles, rep(1000), code(288) 1.asm: table
    28763 cycles, rep(1000), code(304) 2.asm: cmp+table
-- test(2)
    29992 cycles, rep(1000), code( 29) 0.asm: cmp
    40331 cycles, rep(1000), code(288) 1.asm: table
    25967 cycles, rep(1000), code(304) 2.asm: cmp+table
-- test(3)
    29737 cycles, rep(1000), code( 29) 0.asm: cmp
    40168 cycles, rep(1000), code(288) 1.asm: table
    29103 cycles, rep(1000), code(304) 2.asm: cmp+table
-- test(4)
    30073 cycles, rep(1000), code( 29) 0.asm: cmp
    41148 cycles, rep(1000), code(288) 1.asm: table
    28872 cycles, rep(1000), code(304) 2.asm: cmp+table

total [1 .. 4], 1++
   112705 cycles 2.asm: cmp+table
   119573 cycles 0.asm: cmp
   159796 cycles 1.asm: table
hit any key to continue...

TimoVJL · January 14, 2022, 05:20:11 AM

Code Select

AMD Ryzen 5 3400G with Radeon Vega Graphics     (SSE4)

3243    kCycles for 100 * cmp al, new dest
1348    kCycles for 100 * table, new dest
2541    kCycles for 100 * cmp al, in place
2202    kCycles for 100 * xlat, new dest
1419    kCycles for 100 * table, source zero-delimited
2641    kCycles for 100 * table, source zero-delimited, stosb
1782    kCycles for 100 * szLower2
1015    kCycles for 100 * szUpper2

1963    kCycles for 100 * cmp al, new dest
1362    kCycles for 100 * table, new dest
2541    kCycles for 100 * cmp al, in place
2201    kCycles for 100 * xlat, new dest
1352    kCycles for 100 * table, source zero-delimited
2632    kCycles for 100 * table, source zero-delimited, stosb
1782    kCycles for 100 * szLower2
981     kCycles for 100 * szUpper2

1983    kCycles for 100 * cmp al, new dest
1344    kCycles for 100 * table, new dest
2517    kCycles for 100 * cmp al, in place
2196    kCycles for 100 * xlat, new dest
1407    kCycles for 100 * table, source zero-delimited
2636    kCycles for 100 * table, source zero-delimited, stosb
1745    kCycles for 100 * szLower2
989     kCycles for 100 * szUpper2

75      bytes for cmp al, new dest
37      bytes for table, new dest
54      bytes for cmp al, in place
35      bytes for xlat, new dest
28      bytes for table, source zero-delimited
28      bytes for table, source zero-delimited, stosb
7       bytes for szLower2
7       bytes for szUpper2

Code Select

AMD Ryzen 5 3400G with Radeon Vega Graphics     (AVX2)
----------------------------------------------
-- test(1)
    77637 cycles, rep(1000), code( 29) 0.asm: cmp
    58463 cycles, rep(1000), code(288) 1.asm: table
    85323 cycles, rep(1000), code(304) 2.asm: cmp+table
-- test(2)
    90772 cycles, rep(1000), code( 29) 0.asm: cmp
    68947 cycles, rep(1000), code(288) 1.asm: table
    99731 cycles, rep(1000), code(304) 2.asm: cmp+table
-- test(3)
    95322 cycles, rep(1000), code( 29) 0.asm: cmp
    67971 cycles, rep(1000), code(288) 1.asm: table
    95539 cycles, rep(1000), code(304) 2.asm: cmp+table
-- test(4)
    98111 cycles, rep(1000), code( 29) 0.asm: cmp
    73314 cycles, rep(1000), code(288) 1.asm: table
    92986 cycles, rep(1000), code(304) 2.asm: cmp+table

total [1 .. 4], 1++
   268695 cycles 1.asm: table
   361842 cycles 0.asm: cmp
   373579 cycles 2.asm: cmp+table

Vortex · January 14, 2022, 06:03:05 AM

Hi Nidud,

Your lookup table function can be made faster by eliminating the first jump coming after test edx,edx :

Code Select

include \masm32\include64\masm64rt.inc

    .data

table_up label sbyte

i = 0

while i lt 256

    if (i ge 'a') and (i le 'z')
        db i and not ' '
    else
        db i
    endif
    i = i + 1

    endm

s db 'This IS a test function.',0

.code

UpperCase PROC string:QWORD

    lea     r8,table_up
    mov     rax,rcx
    dec     rcx

_loop:

    inc     rcx
    movzx   edx,byte ptr [rcx]

    mov     r9b,[r8+rdx]
    mov     [rcx],r9b
    test    edx,edx
    jnz     _loop
    ret

UpperCase ENDP

start PROC

    invoke  UpperCase,ADDR s
    invoke  StdOut,ADDR s
    invoke  ExitProcess,0

start ENDP

END

jj2007 · January 14, 2022, 06:29:43 AM

@all: thanks

I guess you all realise that szLower2 and szUpper2 look fast because they are in-place algos. That is, after one iteration, the remaining 99 iterations are performed on an already converted string. And that is way faster because it needs only one cmp, and jumps immediately.

I did a test to overcome this problem, as follows:

Code Select

NameH equ szUpper2+szLower
TestH proc
  mov ebx, AlgoLoops/2-1	; loop e.g. 100x
  align 4
  .Repeat
	invoke szLower2, offset Src
	invoke szUpper2, offset Src
	dec ebx
  .Until Sign?
  ret
TestH endp

Code Select

Intel(R) Core(TM) i5-2450M CPU @ 2.50GHz (SSE4)
+19 of 20 tests valid, loop overhead is approx. 395/100 cycles

4988    kCycles for 100 * cmp al, new dest
1970    kCycles for 100 * table, new dest
3732    kCycles for 100 * cmp al, in place
2026    kCycles for 100 * xlat, new dest
1973    kCycles for 100 * table, source zero-delimited
1975    kCycles for 100 * table, source zero-delimited, stosb
2205    kCycles for 100 * szLower2
8608    kCycles for 100 * szUpper2+szLower

2669    kCycles for 100 * cmp al, new dest
1970    kCycles for 100 * table, new dest
3844    kCycles for 100 * cmp al, in place
2029    kCycles for 100 * xlat, new dest
1971    kCycles for 100 * table, source zero-delimited
1974    kCycles for 100 * table, source zero-delimited, stosb
2209    kCycles for 100 * szLower2
8605    kCycles for 100 * szUpper2+szLower

guga · January 14, 2022, 04:49:58 PM

Code Select

AMD Ryzen 5 2400G with Radeon Vega Graphics     (SSE4)

3548    kCycles for 100 * cmp al, new dest
1512    kCycles for 100 * table, new dest
2841    kCycles for 100 * cmp al, in place
2469    kCycles for 100 * xlat, new dest
1507    kCycles for 100 * table, source zero-delimited
2846    kCycles for 100 * table, source zero-delimited, stosb
2360    kCycles for 100 * szLower2
1461    kCycles for 100 * szUpper2

2584    kCycles for 100 * cmp al, new dest
1503    kCycles for 100 * table, new dest
2795    kCycles for 100 * cmp al, in place
2459    kCycles for 100 * xlat, new dest
1543    kCycles for 100 * table, source zero-delimited
2903    kCycles for 100 * table, source zero-delimited, stosb
2195    kCycles for 100 * szLower2
1140    kCycles for 100 * szUpper2

2179    kCycles for 100 * cmp al, new dest
1464    kCycles for 100 * table, new dest
2761    kCycles for 100 * cmp al, in place
3423    kCycles for 100 * xlat, new dest
2219    kCycles for 100 * table, source zero-delimited
3214    kCycles for 100 * table, source zero-delimited, stosb
1955    kCycles for 100 * szLower2
1022    kCycles for 100 * szUpper2

75      bytes for cmp al, new dest
37      bytes for table, new dest
54      bytes for cmp al, in place
35      bytes for xlat, new dest
28      bytes for table, source zero-delimited
28      bytes for table, source zero-delimited, stosb
7       bytes for szLower2
7       bytes for szUpper2


--- ok ---

quarantined · January 16, 2022, 08:00:23 AM

Intel(R) Core(TM)2 Duo CPU E8400 @ 3.00GHz (SSE4)

8409 kCycles for 100 * cmp al, new dest
2725 kCycles for 100 * table, new dest
7996 kCycles for 100 * cmp al, in place
2952 kCycles for 100 * xlat, new dest
8216 kCycles for 100 * LevelUp, in place

8272 kCycles for 100 * cmp al, new dest
2679 kCycles for 100 * table, new dest
8048 kCycles for 100 * cmp al, in place
2961 kCycles for 100 * xlat, new dest
8161 kCycles for 100 * LevelUp, in place

8218 kCycles for 100 * cmp al, new dest
2691 kCycles for 100 * table, new dest
8011 kCycles for 100 * cmp al, in place
2957 kCycles for 100 * xlat, new dest
8178 kCycles for 100 * LevelUp, in place

75 bytes for cmp al, new dest
65 bytes for table, new dest
54 bytes for cmp al, in place
35 bytes for xlat, new dest
66 bytes for LevelUp, in place

--- ok ---

Intel(R) Core(TM)2 Duo CPU E8400 @ 3.00GHz (SSE4)

8418 kCycles for 100 * cmp al, new dest
2556 kCycles for 100 * table, new dest
8051 kCycles for 100 * cmp al, in place
2949 kCycles for 100 * xlat, new dest
2958 kCycles for 100 * table, source zero-delimited
2968 kCycles for 100 * table, source zero-delimited, stosb
7249 kCycles for 100 * szLower
1976 kCycles for 100 * szUpper

5304 kCycles for 100 * cmp al, new dest
2352 kCycles for 100 * table, new dest
8046 kCycles for 100 * cmp al, in place
3097 kCycles for 100 * xlat, new dest
2939 kCycles for 100 * table, source zero-delimited
2949 kCycles for 100 * table, source zero-delimited, stosb
7098 kCycles for 100 * szLower
1975 kCycles for 100 * szUpper

5284 kCycles for 100 * cmp al, new dest
2316 kCycles for 100 * table, new dest
8296 kCycles for 100 * cmp al, in place
3099 kCycles for 100 * xlat, new dest
3115 kCycles for 100 * table, source zero-delimited
2970 kCycles for 100 * table, source zero-delimited, stosb
7079 kCycles for 100 * szLower
1969 kCycles for 100 * szUpper

75 bytes for cmp al, new dest
37 bytes for table, new dest
54 bytes for cmp al, in place
35 bytes for xlat, new dest
28 bytes for table, source zero-delimited
28 bytes for table, source zero-delimited, stosb
7 bytes for szLower
7 bytes for szUpper

--- ok ---

The MASM Forum

News:

ToUpper & ToLower timings

jj2007

jj2007

nidud

hutch--

jj2007

LiaoMi

LiaoMi

TimoVJL

Vortex

jj2007

guga

quarantined