News:

Masm32 SDK description, downloads and other helpful links
Message to All Guests
NB: Posting URL's See here: Posted URL Change

Main Menu

ToUpper & ToLower timings

Started by jj2007, January 11, 2022, 01:06:27 PM

Previous topic - Next topic

jj2007

Quote from: Vortex on January 13, 2022, 04:48:33 AM
Hi Jochen,

Thanks, I liked your code creating the lookup table :thumbsup: Better and more elegant than my handcrafted tables.

Hi Erol,

It's called "cheating", but it fits the purpose ;-)

jj2007

Quote from: hutch-- on January 13, 2022, 04:55:10 AM
JJ,

If you have the time, clock these two out of the library.

I have a suspicion that szUpper profits from the fact that it converts "in place". So in run #2 it's already all uppercase. But why is szLower so slow then? Mysteries :sad:

Intel(R) Core(TM) i5-2450M CPU @ 2.50GHz (SSE4)
+19 of 20 tests valid, loop overhead is approx. 631/100 cycles

4066    kCycles for 100 * cmp al, new dest
1606    kCycles for 100 * table, new dest
3050    kCycles for 100 * cmp al, in place
1659    kCycles for 100 * xlat, new dest
1614    kCycles for 100 * table, source zero-delimited
1612    kCycles for 100 * table, source zero-delimited, stosb
6310    kCycles for 100 * szLower
1610    kCycles for 100 * szUpper

2198    kCycles for 100 * cmp al, new dest
1626    kCycles for 100 * table, new dest
3139    kCycles for 100 * cmp al, in place
1657    kCycles for 100 * xlat, new dest
1610    kCycles for 100 * table, source zero-delimited
1631    kCycles for 100 * table, source zero-delimited, stosb
6308    kCycles for 100 * szLower
1615    kCycles for 100 * szUpper

2192    kCycles for 100 * cmp al, new dest
1613    kCycles for 100 * table, new dest
3049    kCycles for 100 * cmp al, in place
1667    kCycles for 100 * xlat, new dest
1613    kCycles for 100 * table, source zero-delimited
1617    kCycles for 100 * table, source zero-delimited, stosb
6359    kCycles for 100 * szLower
1612    kCycles for 100 * szUpper

75      bytes for cmp al, new dest
37      bytes for table, new dest
54      bytes for cmp al, in place
35      bytes for xlat, new dest
28      bytes for table, source zero-delimited
28      bytes for table, source zero-delimited, stosb
7       bytes for szLower
7       bytes for szUpper

nidud

#17
deleted

hutch--

I have no idea why they differ so much.

Intel(R) Core(TM) i7-5820K CPU @ 3.30GHz (SSE4)

3954    kCycles for 100 * cmp al, new dest
1281    kCycles for 100 * table, new dest
1896    kCycles for 100 * cmp al, in place
1603    kCycles for 100 * xlat, new dest
1490    kCycles for 100 * table, source zero-delimited
1693    kCycles for 100 * table, source zero-delimited, stosb
6283    kCycles for 100 * szLower
1541    kCycles for 100 * szUpper

1989    kCycles for 100 * cmp al, new dest
1304    kCycles for 100 * table, new dest
1897    kCycles for 100 * cmp al, in place
1603    kCycles for 100 * xlat, new dest
1482    kCycles for 100 * table, source zero-delimited
1695    kCycles for 100 * table, source zero-delimited, stosb
6282    kCycles for 100 * szLower
1618    kCycles for 100 * szUpper

1990    kCycles for 100 * cmp al, new dest
1280    kCycles for 100 * table, new dest
1898    kCycles for 100 * cmp al, in place
1608    kCycles for 100 * xlat, new dest
1479    kCycles for 100 * table, source zero-delimited
1693    kCycles for 100 * table, source zero-delimited, stosb
6295    kCycles for 100 * szLower
1618    kCycles for 100 * szUpper

75      bytes for cmp al, new dest
37      bytes for table, new dest
54      bytes for cmp al, in place
35      bytes for xlat, new dest
28      bytes for table, source zero-delimited
28      bytes for table, source zero-delimited, stosb
7       bytes for szLower
7       bytes for szUpper

jj2007

Quote from: hutch-- on January 13, 2022, 09:33:43 AM
I have no idea why they differ so much.

It's the "in place" thing:
szLower:
    cmp BYTE PTR [eax], 0   ; after the first iteration, it's a lowerstring
    je @F
    cmp BYTE PTR [eax], "A"   ; nope
    jb @B
    cmp BYTE PTR [eax], "Z"   ; this branch will be taken
    ja @B

szUpper:
    cmp BYTE PTR [eax], 0   ; after the first iteration, it's an upperstring
    je @F
    cmp BYTE PTR [eax], "a"   ; this branch will be taken
    jb @B
    cmp BYTE PTR [eax], "z"
    ja @B

Now this is weird... a big improvement for szLower, partly because I changed the AZ order, but mostly because of a single instruction change:

Intel(R) Core(TM) i5-2450M CPU @ 2.50GHz (SSE4)

4076    kCycles for 100 * cmp al, new dest
1614    kCycles for 100 * table, new dest
3152    kCycles for 100 * cmp al, in place
1657    kCycles for 100 * xlat, new dest
1613    kCycles for 100 * table, source zero-delimited
1617    kCycles for 100 * table, source zero-delimited, stosb
1827    kCycles for 100 * szLower2
1635    kCycles for 100 * szUpper2

2189    kCycles for 100 * cmp al, new dest
1613    kCycles for 100 * table, new dest
3143    kCycles for 100 * cmp al, in place
1688    kCycles for 100 * xlat, new dest
1623    kCycles for 100 * table, source zero-delimited
1622    kCycles for 100 * table, source zero-delimited, stosb
1806    kCycles for 100 * szLower2
1616    kCycles for 100 * szUpper2


Big improvement for szLower, right?
OPTION PROLOGUE:NONE
OPTION EPILOGUE:NONE

align 4

szLower2 proc text:DWORD

  ; -----------------------------
  ; converts string to lower case
  ; invoke szLower,ADDR szString
  ; -----------------------------

    mov eax, [esp+4]
    dec eax
; align 2 ; much, much slower on my i5
  @@:
  if 1
    inc eax ; much, much faster than add eax, 1
  else
    add eax, 1
  endif
    cmp BYTE PTR [eax], 0
    je @F
    cmp BYTE PTR [eax], "Z"
    ja @B
    cmp BYTE PTR [eax], "A"
    jb @B
    or BYTE PTR [eax], 32
    jmp @B
  @@:

    mov eax, [esp+4]

    ret 4

szLower2 endp


Btw the source assembles without MasmBasic, it's plain Masm32 SDK :cool:

LiaoMi

Quote from: jj2007 on January 13, 2022, 11:03:17 AM
Quote from: hutch-- on January 13, 2022, 09:33:43 AM
I have no idea why they differ so much.

It's the "in place" thing:
szLower:
    cmp BYTE PTR [eax], 0   ; after the first iteration, it's a lowerstring
    je @F
    cmp BYTE PTR [eax], "A"   ; nope
    jb @B
    cmp BYTE PTR [eax], "Z"   ; this branch will be taken
    ja @B

szUpper:
    cmp BYTE PTR [eax], 0   ; after the first iteration, it's an upperstring
    je @F
    cmp BYTE PTR [eax], "a"   ; this branch will be taken
    jb @B
    cmp BYTE PTR [eax], "z"
    ja @B

Now this is weird... a big improvement for szLower, partly because I changed the AZ order, but mostly because of a single instruction change:

Big improvement for szLower, right?
OPTION PROLOGUE:NONE
OPTION EPILOGUE:NONE

align 4

szLower2 proc text:DWORD

  ; -----------------------------
  ; converts string to lower case
  ; invoke szLower,ADDR szString
  ; -----------------------------

    mov eax, [esp+4]
    dec eax
; align 2 ; much, much slower on my i5
  @@:
  if 1
    inc eax ; much, much faster than add eax, 1
  else
    add eax, 1
  endif
    cmp BYTE PTR [eax], 0
    je @F
    cmp BYTE PTR [eax], "Z"
    ja @B
    cmp BYTE PTR [eax], "A"
    jb @B
    or BYTE PTR [eax], 32
    jmp @B
  @@:

    mov eax, [esp+4]

    ret 4

szLower2 endp


Btw the source assembles without MasmBasic, it's plain Masm32 SDK :cool:

11th Gen Intel(R) Core(TM) i7-11800H @ 2.30GHz (SSE4)

1441    kCycles for 100 * cmp al, new dest
705     kCycles for 100 * table, new dest
1147    kCycles for 100 * cmp al, in place
981     kCycles for 100 * xlat, new dest
887     kCycles for 100 * table, source zero-delimited
914     kCycles for 100 * table, source zero-delimited, stosb
1088    kCycles for 100 * szLower2
952     kCycles for 100 * szUpper2

938     kCycles for 100 * cmp al, new dest
706     kCycles for 100 * table, new dest
1142    kCycles for 100 * cmp al, in place
974     kCycles for 100 * xlat, new dest
963     kCycles for 100 * table, source zero-delimited
914     kCycles for 100 * table, source zero-delimited, stosb
1107    kCycles for 100 * szLower2
974     kCycles for 100 * szUpper2

1013    kCycles for 100 * cmp al, new dest
702     kCycles for 100 * table, new dest
1162    kCycles for 100 * cmp al, in place
978     kCycles for 100 * xlat, new dest
893     kCycles for 100 * table, source zero-delimited
956     kCycles for 100 * table, source zero-delimited, stosb
1124    kCycles for 100 * szLower2
953     kCycles for 100 * szUpper2

75      bytes for cmp al, new dest
37      bytes for table, new dest
54      bytes for cmp al, in place
35      bytes for xlat, new dest
28      bytes for table, source zero-delimited
28      bytes for table, source zero-delimited, stosb
7       bytes for szLower2
7       bytes for szUpper2


--- ok ---

LiaoMi

Quote from: nidud on January 13, 2022, 07:23:05 AM
They will probably be similar with a small advantage for the table.

11th Gen Intel(R) Core(TM) i7-11800H @ 2.30GHz (AVX512)
----------------------------------------------
-- test(1)
    29771 cycles, rep(1000), code( 29) 0.asm: cmp
    38149 cycles, rep(1000), code(288) 1.asm: table
    28763 cycles, rep(1000), code(304) 2.asm: cmp+table
-- test(2)
    29992 cycles, rep(1000), code( 29) 0.asm: cmp
    40331 cycles, rep(1000), code(288) 1.asm: table
    25967 cycles, rep(1000), code(304) 2.asm: cmp+table
-- test(3)
    29737 cycles, rep(1000), code( 29) 0.asm: cmp
    40168 cycles, rep(1000), code(288) 1.asm: table
    29103 cycles, rep(1000), code(304) 2.asm: cmp+table
-- test(4)
    30073 cycles, rep(1000), code( 29) 0.asm: cmp
    41148 cycles, rep(1000), code(288) 1.asm: table
    28872 cycles, rep(1000), code(304) 2.asm: cmp+table

total [1 .. 4], 1++
   112705 cycles 2.asm: cmp+table
   119573 cycles 0.asm: cmp
   159796 cycles 1.asm: table
hit any key to continue...

TimoVJL

AMD Ryzen 5 3400G with Radeon Vega Graphics     (SSE4)

3243    kCycles for 100 * cmp al, new dest
1348    kCycles for 100 * table, new dest
2541    kCycles for 100 * cmp al, in place
2202    kCycles for 100 * xlat, new dest
1419    kCycles for 100 * table, source zero-delimited
2641    kCycles for 100 * table, source zero-delimited, stosb
1782    kCycles for 100 * szLower2
1015    kCycles for 100 * szUpper2

1963    kCycles for 100 * cmp al, new dest
1362    kCycles for 100 * table, new dest
2541    kCycles for 100 * cmp al, in place
2201    kCycles for 100 * xlat, new dest
1352    kCycles for 100 * table, source zero-delimited
2632    kCycles for 100 * table, source zero-delimited, stosb
1782    kCycles for 100 * szLower2
981     kCycles for 100 * szUpper2

1983    kCycles for 100 * cmp al, new dest
1344    kCycles for 100 * table, new dest
2517    kCycles for 100 * cmp al, in place
2196    kCycles for 100 * xlat, new dest
1407    kCycles for 100 * table, source zero-delimited
2636    kCycles for 100 * table, source zero-delimited, stosb
1745    kCycles for 100 * szLower2
989     kCycles for 100 * szUpper2

75      bytes for cmp al, new dest
37      bytes for table, new dest
54      bytes for cmp al, in place
35      bytes for xlat, new dest
28      bytes for table, source zero-delimited
28      bytes for table, source zero-delimited, stosb
7       bytes for szLower2
7       bytes for szUpper2
AMD Ryzen 5 3400G with Radeon Vega Graphics     (AVX2)
----------------------------------------------
-- test(1)
    77637 cycles, rep(1000), code( 29) 0.asm: cmp
    58463 cycles, rep(1000), code(288) 1.asm: table
    85323 cycles, rep(1000), code(304) 2.asm: cmp+table
-- test(2)
    90772 cycles, rep(1000), code( 29) 0.asm: cmp
    68947 cycles, rep(1000), code(288) 1.asm: table
    99731 cycles, rep(1000), code(304) 2.asm: cmp+table
-- test(3)
    95322 cycles, rep(1000), code( 29) 0.asm: cmp
    67971 cycles, rep(1000), code(288) 1.asm: table
    95539 cycles, rep(1000), code(304) 2.asm: cmp+table
-- test(4)
    98111 cycles, rep(1000), code( 29) 0.asm: cmp
    73314 cycles, rep(1000), code(288) 1.asm: table
    92986 cycles, rep(1000), code(304) 2.asm: cmp+table

total [1 .. 4], 1++
   268695 cycles 1.asm: table
   361842 cycles 0.asm: cmp
   373579 cycles 2.asm: cmp+table
May the source be with you

Vortex

Hi Nidud,

Your lookup table function can be made faster by eliminating the first jump coming after test edx,edx :

include \masm32\include64\masm64rt.inc

    .data

table_up label sbyte

i = 0

while i lt 256

    if (i ge 'a') and (i le 'z')
        db i and not ' '
    else
        db i
    endif
    i = i + 1

    endm

s db 'This IS a test function.',0

.code

UpperCase PROC string:QWORD

    lea     r8,table_up
    mov     rax,rcx
    dec     rcx

_loop:

    inc     rcx
    movzx   edx,byte ptr [rcx]

    mov     r9b,[r8+rdx]
    mov     [rcx],r9b
    test    edx,edx
    jnz     _loop
    ret

UpperCase ENDP

start PROC

    invoke  UpperCase,ADDR s
    invoke  StdOut,ADDR s
    invoke  ExitProcess,0

start ENDP

END

jj2007

@all: thanks :thup:

I guess you all realise that szLower2 and szUpper2 look fast because they are in-place algos. That is, after one iteration, the remaining 99 iterations are performed on an already converted string. And that is way faster because it needs only one cmp, and jumps immediately.

I did a test to overcome this problem, as follows:
NameH equ szUpper2+szLower
TestH proc
  mov ebx, AlgoLoops/2-1 ; loop e.g. 100x
  align 4
  .Repeat
invoke szLower2, offset Src
invoke szUpper2, offset Src
dec ebx
  .Until Sign?
  ret
TestH endp


Intel(R) Core(TM) i5-2450M CPU @ 2.50GHz (SSE4)
+19 of 20 tests valid, loop overhead is approx. 395/100 cycles

4988    kCycles for 100 * cmp al, new dest
1970    kCycles for 100 * table, new dest
3732    kCycles for 100 * cmp al, in place
2026    kCycles for 100 * xlat, new dest
1973    kCycles for 100 * table, source zero-delimited
1975    kCycles for 100 * table, source zero-delimited, stosb
2205    kCycles for 100 * szLower2
8608    kCycles for 100 * szUpper2+szLower

2669    kCycles for 100 * cmp al, new dest
1970    kCycles for 100 * table, new dest
3844    kCycles for 100 * cmp al, in place
2029    kCycles for 100 * xlat, new dest
1971    kCycles for 100 * table, source zero-delimited
1974    kCycles for 100 * table, source zero-delimited, stosb
2209    kCycles for 100 * szLower2
8605    kCycles for 100 * szUpper2+szLower

guga

AMD Ryzen 5 2400G with Radeon Vega Graphics     (SSE4)

3548    kCycles for 100 * cmp al, new dest
1512    kCycles for 100 * table, new dest
2841    kCycles for 100 * cmp al, in place
2469    kCycles for 100 * xlat, new dest
1507    kCycles for 100 * table, source zero-delimited
2846    kCycles for 100 * table, source zero-delimited, stosb
2360    kCycles for 100 * szLower2
1461    kCycles for 100 * szUpper2

2584    kCycles for 100 * cmp al, new dest
1503    kCycles for 100 * table, new dest
2795    kCycles for 100 * cmp al, in place
2459    kCycles for 100 * xlat, new dest
1543    kCycles for 100 * table, source zero-delimited
2903    kCycles for 100 * table, source zero-delimited, stosb
2195    kCycles for 100 * szLower2
1140    kCycles for 100 * szUpper2

2179    kCycles for 100 * cmp al, new dest
1464    kCycles for 100 * table, new dest
2761    kCycles for 100 * cmp al, in place
3423    kCycles for 100 * xlat, new dest
2219    kCycles for 100 * table, source zero-delimited
3214    kCycles for 100 * table, source zero-delimited, stosb
1955    kCycles for 100 * szLower2
1022    kCycles for 100 * szUpper2

75      bytes for cmp al, new dest
37      bytes for table, new dest
54      bytes for cmp al, in place
35      bytes for xlat, new dest
28      bytes for table, source zero-delimited
28      bytes for table, source zero-delimited, stosb
7       bytes for szLower2
7       bytes for szUpper2


--- ok ---
Coding in Assembly requires a mix of:
80% of brain, passion, intuition, creativity
10% of programming skills
10% of alcoholic levels in your blood.

My Code Sites:
http://rosasm.freeforums.org
http://winasm.tripod.com

quarantined

Intel(R) Core(TM)2 Duo CPU     E8400  @ 3.00GHz (SSE4)

8409    kCycles for 100 * cmp al, new dest
2725    kCycles for 100 * table, new dest
7996    kCycles for 100 * cmp al, in place
2952    kCycles for 100 * xlat, new dest
8216    kCycles for 100 * LevelUp, in place

8272    kCycles for 100 * cmp al, new dest
2679    kCycles for 100 * table, new dest
8048    kCycles for 100 * cmp al, in place
2961    kCycles for 100 * xlat, new dest
8161    kCycles for 100 * LevelUp, in place

8218    kCycles for 100 * cmp al, new dest
2691    kCycles for 100 * table, new dest
8011    kCycles for 100 * cmp al, in place
2957    kCycles for 100 * xlat, new dest
8178    kCycles for 100 * LevelUp, in place

75      bytes for cmp al, new dest
65      bytes for table, new dest
54      bytes for cmp al, in place
35      bytes for xlat, new dest
66      bytes for LevelUp, in place


--- ok ---

Intel(R) Core(TM)2 Duo CPU     E8400  @ 3.00GHz (SSE4)

8418    kCycles for 100 * cmp al, new dest
2556    kCycles for 100 * table, new dest
8051    kCycles for 100 * cmp al, in place
2949    kCycles for 100 * xlat, new dest
2958    kCycles for 100 * table, source zero-delimited
2968    kCycles for 100 * table, source zero-delimited, stosb
7249    kCycles for 100 * szLower
1976    kCycles for 100 * szUpper

5304    kCycles for 100 * cmp al, new dest
2352    kCycles for 100 * table, new dest
8046    kCycles for 100 * cmp al, in place
3097    kCycles for 100 * xlat, new dest
2939    kCycles for 100 * table, source zero-delimited
2949    kCycles for 100 * table, source zero-delimited, stosb
7098    kCycles for 100 * szLower
1975    kCycles for 100 * szUpper

5284    kCycles for 100 * cmp al, new dest
2316    kCycles for 100 * table, new dest
8296    kCycles for 100 * cmp al, in place
3099    kCycles for 100 * xlat, new dest
3115    kCycles for 100 * table, source zero-delimited
2970    kCycles for 100 * table, source zero-delimited, stosb
7079    kCycles for 100 * szLower
1969    kCycles for 100 * szUpper

75      bytes for cmp al, new dest
37      bytes for table, new dest
54      bytes for cmp al, in place
35      bytes for xlat, new dest
28      bytes for table, source zero-delimited
28      bytes for table, source zero-delimited, stosb
7       bytes for szLower
7       bytes for szUpper


--- ok ---