This collection is almost 8 years old, but apparently it has never been tested in the Lab :cool:
Intel(R) Core(TM) i5-2450M CPU @ 2.50GHz (SSE4)
+++17 of 20 tests valid, loop overhead is approx. 644/100 cycles
4074 kCycles for 100 * cmp al, new dest
1623 kCycles for 100 * table, new dest
3063 kCycles for 100 * cmp al, in place
1669 kCycles for 100 * xlat, new dest
5571 kCycles for 100 * LevelUp, in place
4528 kCycles for 100 * cmp al, new dest
1618 kCycles for 100 * table, new dest
3631 kCycles for 100 * cmp al, in place
2802 kCycles for 100 * xlat, new dest
5553 kCycles for 100 * LevelUp, in place
4569 kCycles for 100 * cmp al, new dest
1633 kCycles for 100 * table, new dest
3076 kCycles for 100 * cmp al, in place
1669 kCycles for 100 * xlat, new dest
5569 kCycles for 100 * LevelUp, in place
75 bytes for cmp al, new dest
65 bytes for table, new dest
54 bytes for cmp al, in place
35 bytes for xlat, new dest
66 bytes for LevelUp, in place
See here for similar stuff at the FreeBasic forum (https://www.freebasic.net/forum/viewtopic.php?f=3&t=31391&p=288920&sid=487127adb17ebcbd08c5e7493258153e#p288917)
Hi :tongue:,
11th Gen Intel(R) Core(TM) i7-11800H @ 2.30GHz (SSE4)
1378 kCycles for 100 * cmp al, new dest
660 kCycles for 100 * table, new dest
1142 kCycles for 100 * cmp al, in place
960 kCycles for 100 * xlat, new dest
1395 kCycles for 100 * LevelUp, in place
1425 kCycles for 100 * cmp al, new dest
674 kCycles for 100 * table, new dest
1185 kCycles for 100 * cmp al, in place
1006 kCycles for 100 * xlat, new dest
1425 kCycles for 100 * LevelUp, in place
1365 kCycles for 100 * cmp al, new dest
671 kCycles for 100 * table, new dest
1159 kCycles for 100 * cmp al, in place
1012 kCycles for 100 * xlat, new dest
1472 kCycles for 100 * LevelUp, in place
75 bytes for cmp al, new dest
65 bytes for table, new dest
54 bytes for cmp al, in place
35 bytes for xlat, new dest
66 bytes for LevelUp, in place
--- ok ---
Intel(R) Core(TM) i5-7200U CPU @ 2.50GHz (SSE4)
++++++14 of 20 tests valid, loop overhead is approx. 323/100 cycles
4207 kCycles for 100 * cmp al, new dest
1550 kCycles for 100 * table, new dest
1980 kCycles for 100 * cmp al, in place
1785 kCycles for 100 * xlat, new dest
3248 kCycles for 100 * LevelUp, in place
5186 kCycles for 100 * cmp al, new dest
1568 kCycles for 100 * table, new dest
2188 kCycles for 100 * cmp al, in place
1821 kCycles for 100 * xlat, new dest
3231 kCycles for 100 * LevelUp, in place
5047 kCycles for 100 * cmp al, new dest
1667 kCycles for 100 * table, new dest
1964 kCycles for 100 * cmp al, in place
1777 kCycles for 100 * xlat, new dest
3310 kCycles for 100 * LevelUp, in place
75 bytes for cmp al, new dest
65 bytes for table, new dest
54 bytes for cmp al, in place
35 bytes for xlat, new dest
66 bytes for LevelUp, in place
-
AMD Ryzen 5 3400G with Radeon Vega Graphics (SSE4)
+19 of 20 tests valid, loop overhead is approx. 322/100 cycles
3438 kCycles for 100 * cmp al, new dest
1883 kCycles for 100 * table, new dest
3641 kCycles for 100 * cmp al, in place
2602 kCycles for 100 * xlat, new dest
2967 kCycles for 100 * LevelUp, in place
3917 kCycles for 100 * cmp al, new dest
1882 kCycles for 100 * table, new dest
3654 kCycles for 100 * cmp al, in place
2513 kCycles for 100 * xlat, new dest
2975 kCycles for 100 * LevelUp, in place
3839 kCycles for 100 * cmp al, new dest
1879 kCycles for 100 * table, new dest
3644 kCycles for 100 * cmp al, in place
2618 kCycles for 100 * xlat, new dest
2903 kCycles for 100 * LevelUp, in place
75 bytes for cmp al, new dest
65 bytes for table, new dest
54 bytes for cmp al, in place
35 bytes for xlat, new dest
66 bytes for LevelUp, in place
Intel(R) Core(TM) i7-7660U CPU @ 2.50GHz (SSE4)
+++++15 of 20 tests valid, loop overhead is approx. 402/100 cycles
3153 kCycles for 100 * cmp al, new dest
1174 kCycles for 100 * table, new dest
1435 kCycles for 100 * cmp al, in place
1584 kCycles for 100 * xlat, new dest
3355 kCycles for 100 * LevelUp, in place
3770 kCycles for 100 * cmp al, new dest
1191 kCycles for 100 * table, new dest
1510 kCycles for 100 * cmp al, in place
1737 kCycles for 100 * xlat, new dest
4648 kCycles for 100 * LevelUp, in place
3725 kCycles for 100 * cmp al, new dest
1334 kCycles for 100 * table, new dest
1969 kCycles for 100 * cmp al, in place
2227 kCycles for 100 * xlat, new dest
4367 kCycles for 100 * LevelUp, in place
75 bytes for cmp al, new dest
65 bytes for table, new dest
54 bytes for cmp al, in place
35 bytes for xlat, new dest
66 bytes for LevelUp, in place
And the winner is...
xor ecx, ecx
align 4
.Repeat
movzx eax, byte ptr [edx+ecx] ; edx is source string
movzx eax, byte ptr TheTable[eax] ; mov al is equally fast on i5
mov [edi+ecx], al ; edi is destination string
inc ecx
.Until ecx>=SrcBytes
I've added two more table-based solutions, both fast and 28 bytes short:
Intel(R) Core(TM) i5-2450M CPU @ 2.50GHz (SSE4)
++++16 of 20 tests valid, loop overhead is approx. 600/100 cycles
4062 kCycles for 100 * cmp al, new dest
1622 kCycles for 100 * table, new dest
3075 kCycles for 100 * cmp al, in place
1668 kCycles for 100 * xlat, new dest
1638 kCycles for 100 * table, source zero-delimited
1634 kCycles for 100 * table, source zero-delimited, stosb
4057 kCycles for 100 * cmp al, new dest
1617 kCycles for 100 * table, new dest
3058 kCycles for 100 * cmp al, in place
1666 kCycles for 100 * xlat, new dest
1620 kCycles for 100 * table, source zero-delimited
1623 kCycles for 100 * table, source zero-delimited, stosb
4089 kCycles for 100 * cmp al, new dest
1607 kCycles for 100 * table, new dest
3191 kCycles for 100 * cmp al, in place
1702 kCycles for 100 * xlat, new dest
1621 kCycles for 100 * table, source zero-delimited
1626 kCycles for 100 * table, source zero-delimited, stosb
75 bytes for cmp al, new dest
37 bytes for table, new dest
54 bytes for cmp al, in place
35 bytes for xlat, new dest
28 bytes for table, source zero-delimited
28 bytes for table, source zero-delimited, stosb
AMD Ryzen 5 3400G with Radeon Vega Graphics (SSE4)
3494 kCycles for 100 * cmp al, new dest
1657 kCycles for 100 * table, new dest
2839 kCycles for 100 * cmp al, in place
2547 kCycles for 100 * xlat, new dest
1672 kCycles for 100 * table, source zero-delimited
2977 kCycles for 100 * table, source zero-delimited, stosb
3485 kCycles for 100 * cmp al, new dest
1719 kCycles for 100 * table, new dest
2883 kCycles for 100 * cmp al, in place
2515 kCycles for 100 * xlat, new dest
1635 kCycles for 100 * table, source zero-delimited
2971 kCycles for 100 * table, source zero-delimited, stosb
3440 kCycles for 100 * cmp al, new dest
1681 kCycles for 100 * table, new dest
2804 kCycles for 100 * cmp al, in place
2502 kCycles for 100 * xlat, new dest
1652 kCycles for 100 * table, source zero-delimited
2938 kCycles for 100 * table, source zero-delimited, stosb
75 bytes for cmp al, new dest
37 bytes for table, new dest
54 bytes for cmp al, in place
35 bytes for xlat, new dest
28 bytes for table, source zero-delimited
28 bytes for table, source zero-delimited, stosb
Intel(R) Core(TM) i5-9400H CPU @ 2.50GHz (SSE4)
2941 kCycles for 100 * cmp al, new dest
999 kCycles for 100 * table, new dest
1898 kCycles for 100 * cmp al, in place
1177 kCycles for 100 * xlat, new dest
1183 kCycles for 100 * table, source zero-delimited
1257 kCycles for 100 * table, source zero-delimited, stosb
2994 kCycles for 100 * cmp al, new dest
890 kCycles for 100 * table, new dest
1902 kCycles for 100 * cmp al, in place
1161 kCycles for 100 * xlat, new dest
1171 kCycles for 100 * table, source zero-delimited
1187 kCycles for 100 * table, source zero-delimited, stosb
2946 kCycles for 100 * cmp al, new dest
909 kCycles for 100 * table, new dest
1923 kCycles for 100 * cmp al, in place
1236 kCycles for 100 * xlat, new dest
1352 kCycles for 100 * table, source zero-delimited
1157 kCycles for 100 * table, source zero-delimited, stosb
75 bytes for cmp al, new dest
37 bytes for table, new dest
54 bytes for cmp al, in place
35 bytes for xlat, new dest
28 bytes for table, source zero-delimited
28 bytes for table, source zero-delimited, stosb
--- ok ---
11th Gen Intel(R) Core(TM) i7-11800H @ 2.30GHz (SSE4)
-19 of 20 tests valid, loop overhead is approx. 78/100 cycles
1397 kCycles for 100 * cmp al, new dest
662 kCycles for 100 * table, new dest
1133 kCycles for 100 * cmp al, in place
966 kCycles for 100 * xlat, new dest
850 kCycles for 100 * table, source zero-delimited
907 kCycles for 100 * table, source zero-delimited, stosb
1348 kCycles for 100 * cmp al, new dest
682 kCycles for 100 * table, new dest
1088 kCycles for 100 * cmp al, in place
924 kCycles for 100 * xlat, new dest
864 kCycles for 100 * table, source zero-delimited
884 kCycles for 100 * table, source zero-delimited, stosb
1389 kCycles for 100 * cmp al, new dest
681 kCycles for 100 * table, new dest
1069 kCycles for 100 * cmp al, in place
936 kCycles for 100 * xlat, new dest
852 kCycles for 100 * table, source zero-delimited
861 kCycles for 100 * table, source zero-delimited, stosb
75 bytes for cmp al, new dest
37 bytes for table, new dest
54 bytes for cmp al, in place
35 bytes for xlat, new dest
28 bytes for table, source zero-delimited
28 bytes for table, source zero-delimited, stosb
--- ok ---
11th Gen Intel(R) Core(TM) i7-11800H @ 2.30GHz (SSE4)
1354 kCycles for 100 * cmp al, new dest
652 kCycles for 100 * table, new dest
1084 kCycles for 100 * cmp al, in place
945 kCycles for 100 * xlat, new dest
843 kCycles for 100 * table, source zero-delimited
879 kCycles for 100 * table, source zero-delimited, stosb
1402 kCycles for 100 * cmp al, new dest
671 kCycles for 100 * table, new dest
1110 kCycles for 100 * cmp al, in place
974 kCycles for 100 * xlat, new dest
916 kCycles for 100 * table, source zero-delimited
939 kCycles for 100 * table, source zero-delimited, stosb
1357 kCycles for 100 * cmp al, new dest
648 kCycles for 100 * table, new dest
1239 kCycles for 100 * cmp al, in place
995 kCycles for 100 * xlat, new dest
871 kCycles for 100 * table, source zero-delimited
871 kCycles for 100 * table, source zero-delimited, stosb
75 bytes for cmp al, new dest
37 bytes for table, new dest
54 bytes for cmp al, in place
35 bytes for xlat, new dest
28 bytes for table, source zero-delimited
28 bytes for table, source zero-delimited, stosb
Intel(R) Core(TM) i7-4770K CPU @ 3.50GHz (SSE4)
4736 kCycles for 100 * cmp al, new dest
1600 kCycles for 100 * table, new dest
2282 kCycles for 100 * cmp al, in place
2394 kCycles for 100 * xlat, new dest
1705 kCycles for 100 * table, source zero-delimited
2012 kCycles for 100 * table, source zero-delimited, stosb
4969 kCycles for 100 * cmp al, new dest
1782 kCycles for 100 * table, new dest
2606 kCycles for 100 * cmp al, in place
2197 kCycles for 100 * xlat, new dest
1927 kCycles for 100 * table, source zero-delimited
2170 kCycles for 100 * table, source zero-delimited, stosb
4848 kCycles for 100 * cmp al, new dest
3238 kCycles for 100 * table, new dest
2276 kCycles for 100 * cmp al, in place
1797 kCycles for 100 * xlat, new dest
1692 kCycles for 100 * table, source zero-delimited
1916 kCycles for 100 * table, source zero-delimited, stosb
75 bytes for cmp al, new dest
37 bytes for table, new dest
54 bytes for cmp al, in place
35 bytes for xlat, new dest
28 bytes for table, source zero-delimited
28 bytes for table, source zero-delimited, stosb
-
Biterider
I just wonder how many terabytes you would need to see the difference. :tongue:
Intel(R) Core(TM) i7-5820K CPU @ 3.30GHz (SSE4)
3931 kCycles for 100 * cmp al, new dest
1224 kCycles for 100 * table, new dest
1869 kCycles for 100 * cmp al, in place
2007 kCycles for 100 * xlat, new dest
5581 kCycles for 100 * LevelUp, in place
4133 kCycles for 100 * cmp al, new dest
1225 kCycles for 100 * table, new dest
2107 kCycles for 100 * cmp al, in place
1735 kCycles for 100 * xlat, new dest
5706 kCycles for 100 * LevelUp, in place
4605 kCycles for 100 * cmp al, new dest
1605 kCycles for 100 * table, new dest
1844 kCycles for 100 * cmp al, in place
1555 kCycles for 100 * xlat, new dest
5506 kCycles for 100 * LevelUp, in place
75 bytes for cmp al, new dest
65 bytes for table, new dest
54 bytes for cmp al, in place
35 bytes for xlat, new dest
66 bytes for LevelUp, in place
Pardon me as I intrude here, as an amateur, but let me get this straight: we're talking about ToUpper() which turns any ASCII alpha characters into uppercase? Is that right?
If so, the method being used seems waaay too complex. Here's mine, which I've been using for decades now:
;============================================
; ToUpper()
;
; If character in AL is alphabetic, uppercases it.
;============================================
ToUpper PROC
CMP AL, 'a'
JB tu99
CMP AL, 'z'
JA tu99
AND AL, 5FH
tu99: RET
ToUpper ENDP
Just twiddle a couple bits is all. (ToLower() would operate similarly, by setting rather than clearing bits.)
Any objections? or am I missing something obvious here?
NoCforMe,
Your routine works for a single char, while the algos in the testbed translate a whole string to uppercase. The first one called "cmp al, new dest" does exactly what yours is doing. However, this one (using a table) is over twice as fast, and very short:
mov edx, offset Src
mov edi, offset Dest
xor eax, eax
align 4
.While 1
movzx ecx, byte ptr [edx+eax]
jecxz @out
movzx ecx, byte ptr TheTable[ecx]
mov [edi+eax], cl
inc eax
.Endw
@out:
As Hutch writes above, there is rarely a need for so much speed. We are doing this just for fun :cool:
Hi Jochen,
Thanks, I liked your code creating the lookup table :thumbsup: Better and more elegant than my handcrafted tables.
JJ,
If you have the time, clock these two out of the library.
; ¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤
OPTION PROLOGUE:NONE
OPTION EPILOGUE:NONE
align 4
szLower proc text:DWORD
; -----------------------------
; converts string to lower case
; invoke szLower,ADDR szString
; -----------------------------
mov eax, [esp+4]
dec eax
@@:
add eax, 1
cmp BYTE PTR [eax], 0
je @F
cmp BYTE PTR [eax], "A"
jb @B
cmp BYTE PTR [eax], "Z"
ja @B
add BYTE PTR [eax], 32
jmp @B
@@:
mov eax, [esp+4]
ret 4
szLower endp
OPTION PROLOGUE:PrologueDef
OPTION EPILOGUE:EpilogueDef
; ¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤
; ¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤
OPTION PROLOGUE:NONE
OPTION EPILOGUE:NONE
align 4
szUpper proc text:DWORD
; -----------------------------
; converts string to upper case
; invoke szUpper,ADDR szString
; -----------------------------
mov eax, [esp+4]
dec eax
@@:
add eax, 1
cmp BYTE PTR [eax], 0
je @F
cmp BYTE PTR [eax], "a"
jb @B
cmp BYTE PTR [eax], "z"
ja @B
sub BYTE PTR [eax], 32
jmp @B
@@:
mov eax, [esp+4]
ret 4
szUpper endp
OPTION PROLOGUE:PrologueDef
OPTION EPILOGUE:EpilogueDef
; ¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤
Quote from: Vortex on January 13, 2022, 04:48:33 AM
Hi Jochen,
Thanks, I liked your code creating the lookup table :thumbsup: Better and more elegant than my handcrafted tables.
Hi Erol,
It's called "cheating", but it fits the purpose ;-)
Quote from: hutch-- on January 13, 2022, 04:55:10 AM
JJ,
If you have the time, clock these two out of the library.
I have a suspicion that szUpper profits from the fact that it converts "in place". So in run #2 it's already all uppercase. But why is szLower so slow then? Mysteries :sad:
Intel(R) Core(TM) i5-2450M CPU @ 2.50GHz (SSE4)
+19 of 20 tests valid, loop overhead is approx. 631/100 cycles
4066 kCycles for 100 * cmp al, new dest
1606 kCycles for 100 * table, new dest
3050 kCycles for 100 * cmp al, in place
1659 kCycles for 100 * xlat, new dest
1614 kCycles for 100 * table, source zero-delimited
1612 kCycles for 100 * table, source zero-delimited, stosb
6310 kCycles for 100 * szLower
1610 kCycles for 100 * szUpper
2198 kCycles for 100 * cmp al, new dest
1626 kCycles for 100 * table, new dest
3139 kCycles for 100 * cmp al, in place
1657 kCycles for 100 * xlat, new dest
1610 kCycles for 100 * table, source zero-delimited
1631 kCycles for 100 * table, source zero-delimited, stosb
6308 kCycles for 100 * szLower
1615 kCycles for 100 * szUpper
2192 kCycles for 100 * cmp al, new dest
1613 kCycles for 100 * table, new dest
3049 kCycles for 100 * cmp al, in place
1667 kCycles for 100 * xlat, new dest
1613 kCycles for 100 * table, source zero-delimited
1617 kCycles for 100 * table, source zero-delimited, stosb
6359 kCycles for 100 * szLower
1612 kCycles for 100 * szUpper
75 bytes for cmp al, new dest
37 bytes for table, new dest
54 bytes for cmp al, in place
35 bytes for xlat, new dest
28 bytes for table, source zero-delimited
28 bytes for table, source zero-delimited, stosb
7 bytes for szLower
7 bytes for szUpper
deleted
I have no idea why they differ so much.
Intel(R) Core(TM) i7-5820K CPU @ 3.30GHz (SSE4)
3954 kCycles for 100 * cmp al, new dest
1281 kCycles for 100 * table, new dest
1896 kCycles for 100 * cmp al, in place
1603 kCycles for 100 * xlat, new dest
1490 kCycles for 100 * table, source zero-delimited
1693 kCycles for 100 * table, source zero-delimited, stosb
6283 kCycles for 100 * szLower
1541 kCycles for 100 * szUpper
1989 kCycles for 100 * cmp al, new dest
1304 kCycles for 100 * table, new dest
1897 kCycles for 100 * cmp al, in place
1603 kCycles for 100 * xlat, new dest
1482 kCycles for 100 * table, source zero-delimited
1695 kCycles for 100 * table, source zero-delimited, stosb
6282 kCycles for 100 * szLower
1618 kCycles for 100 * szUpper
1990 kCycles for 100 * cmp al, new dest
1280 kCycles for 100 * table, new dest
1898 kCycles for 100 * cmp al, in place
1608 kCycles for 100 * xlat, new dest
1479 kCycles for 100 * table, source zero-delimited
1693 kCycles for 100 * table, source zero-delimited, stosb
6295 kCycles for 100 * szLower
1618 kCycles for 100 * szUpper
75 bytes for cmp al, new dest
37 bytes for table, new dest
54 bytes for cmp al, in place
35 bytes for xlat, new dest
28 bytes for table, source zero-delimited
28 bytes for table, source zero-delimited, stosb
7 bytes for szLower
7 bytes for szUpper
Quote from: hutch-- on January 13, 2022, 09:33:43 AM
I have no idea why they differ so much.
It's the "in place" thing:
szLower:
cmp BYTE PTR [eax], 0 ; after the first iteration, it's a lowerstring
je @F
cmp BYTE PTR [eax], "A" ; nope
jb @B
cmp BYTE PTR [eax], "Z" ; this branch will be taken
ja @B
szUpper:
cmp BYTE PTR [eax], 0 ; after the first iteration, it's an upperstring
je @F
cmp BYTE PTR [eax], "a" ; this branch will be taken
jb @B
cmp BYTE PTR [eax], "z"
ja @B
Now this is weird... a big improvement for szLower, partly because I changed the AZ order, but mostly because of a single instruction change:
Intel(R) Core(TM) i5-2450M CPU @ 2.50GHz (SSE4)
4076 kCycles for 100 * cmp al, new dest
1614 kCycles for 100 * table, new dest
3152 kCycles for 100 * cmp al, in place
1657 kCycles for 100 * xlat, new dest
1613 kCycles for 100 * table, source zero-delimited
1617 kCycles for 100 * table, source zero-delimited, stosb
1827 kCycles for 100 * szLower2
1635 kCycles for 100 * szUpper2
2189 kCycles for 100 * cmp al, new dest
1613 kCycles for 100 * table, new dest
3143 kCycles for 100 * cmp al, in place
1688 kCycles for 100 * xlat, new dest
1623 kCycles for 100 * table, source zero-delimited
1622 kCycles for 100 * table, source zero-delimited, stosb
1806 kCycles for 100 * szLower2
1616 kCycles for 100 * szUpper2
Big improvement for szLower, right?
OPTION PROLOGUE:NONE
OPTION EPILOGUE:NONE
align 4
szLower2 proc text:DWORD
; -----------------------------
; converts string to lower case
; invoke szLower,ADDR szString
; -----------------------------
mov eax, [esp+4]
dec eax
; align 2 ; much, much slower on my i5
@@:
if 1
inc eax ; much, much faster than add eax, 1
else
add eax, 1
endif
cmp BYTE PTR [eax], 0
je @F
cmp BYTE PTR [eax], "Z"
ja @B
cmp BYTE PTR [eax], "A"
jb @B
or BYTE PTR [eax], 32
jmp @B
@@:
mov eax, [esp+4]
ret 4
szLower2 endp
Btw the source assembles without MasmBasic, it's plain Masm32 SDK :cool:
Quote from: jj2007 on January 13, 2022, 11:03:17 AM
Quote from: hutch-- on January 13, 2022, 09:33:43 AM
I have no idea why they differ so much.
It's the "in place" thing:
szLower:
cmp BYTE PTR [eax], 0 ; after the first iteration, it's a lowerstring
je @F
cmp BYTE PTR [eax], "A" ; nope
jb @B
cmp BYTE PTR [eax], "Z" ; this branch will be taken
ja @B
szUpper:
cmp BYTE PTR [eax], 0 ; after the first iteration, it's an upperstring
je @F
cmp BYTE PTR [eax], "a" ; this branch will be taken
jb @B
cmp BYTE PTR [eax], "z"
ja @B
Now this is weird... a big improvement for szLower, partly because I changed the AZ order, but mostly because of a single instruction change:
Big improvement for szLower, right?
OPTION PROLOGUE:NONE
OPTION EPILOGUE:NONE
align 4
szLower2 proc text:DWORD
; -----------------------------
; converts string to lower case
; invoke szLower,ADDR szString
; -----------------------------
mov eax, [esp+4]
dec eax
; align 2 ; much, much slower on my i5
@@:
if 1
inc eax ; much, much faster than add eax, 1
else
add eax, 1
endif
cmp BYTE PTR [eax], 0
je @F
cmp BYTE PTR [eax], "Z"
ja @B
cmp BYTE PTR [eax], "A"
jb @B
or BYTE PTR [eax], 32
jmp @B
@@:
mov eax, [esp+4]
ret 4
szLower2 endp
Btw the source assembles without MasmBasic, it's plain Masm32 SDK :cool:
11th Gen Intel(R) Core(TM) i7-11800H @ 2.30GHz (SSE4)
1441 kCycles for 100 * cmp al, new dest
705 kCycles for 100 * table, new dest
1147 kCycles for 100 * cmp al, in place
981 kCycles for 100 * xlat, new dest
887 kCycles for 100 * table, source zero-delimited
914 kCycles for 100 * table, source zero-delimited, stosb
1088 kCycles for 100 * szLower2
952 kCycles for 100 * szUpper2
938 kCycles for 100 * cmp al, new dest
706 kCycles for 100 * table, new dest
1142 kCycles for 100 * cmp al, in place
974 kCycles for 100 * xlat, new dest
963 kCycles for 100 * table, source zero-delimited
914 kCycles for 100 * table, source zero-delimited, stosb
1107 kCycles for 100 * szLower2
974 kCycles for 100 * szUpper2
1013 kCycles for 100 * cmp al, new dest
702 kCycles for 100 * table, new dest
1162 kCycles for 100 * cmp al, in place
978 kCycles for 100 * xlat, new dest
893 kCycles for 100 * table, source zero-delimited
956 kCycles for 100 * table, source zero-delimited, stosb
1124 kCycles for 100 * szLower2
953 kCycles for 100 * szUpper2
75 bytes for cmp al, new dest
37 bytes for table, new dest
54 bytes for cmp al, in place
35 bytes for xlat, new dest
28 bytes for table, source zero-delimited
28 bytes for table, source zero-delimited, stosb
7 bytes for szLower2
7 bytes for szUpper2
--- ok ---
Quote from: nidud on January 13, 2022, 07:23:05 AM
They will probably be similar with a small advantage for the table.
11th Gen Intel(R) Core(TM) i7-11800H @ 2.30GHz (AVX512)
----------------------------------------------
-- test(1)
29771 cycles, rep(1000), code( 29) 0.asm: cmp
38149 cycles, rep(1000), code(288) 1.asm: table
28763 cycles, rep(1000), code(304) 2.asm: cmp+table
-- test(2)
29992 cycles, rep(1000), code( 29) 0.asm: cmp
40331 cycles, rep(1000), code(288) 1.asm: table
25967 cycles, rep(1000), code(304) 2.asm: cmp+table
-- test(3)
29737 cycles, rep(1000), code( 29) 0.asm: cmp
40168 cycles, rep(1000), code(288) 1.asm: table
29103 cycles, rep(1000), code(304) 2.asm: cmp+table
-- test(4)
30073 cycles, rep(1000), code( 29) 0.asm: cmp
41148 cycles, rep(1000), code(288) 1.asm: table
28872 cycles, rep(1000), code(304) 2.asm: cmp+table
total [1 .. 4], 1++
112705 cycles 2.asm: cmp+table
119573 cycles 0.asm: cmp
159796 cycles 1.asm: table
hit any key to continue...
AMD Ryzen 5 3400G with Radeon Vega Graphics (SSE4)
3243 kCycles for 100 * cmp al, new dest
1348 kCycles for 100 * table, new dest
2541 kCycles for 100 * cmp al, in place
2202 kCycles for 100 * xlat, new dest
1419 kCycles for 100 * table, source zero-delimited
2641 kCycles for 100 * table, source zero-delimited, stosb
1782 kCycles for 100 * szLower2
1015 kCycles for 100 * szUpper2
1963 kCycles for 100 * cmp al, new dest
1362 kCycles for 100 * table, new dest
2541 kCycles for 100 * cmp al, in place
2201 kCycles for 100 * xlat, new dest
1352 kCycles for 100 * table, source zero-delimited
2632 kCycles for 100 * table, source zero-delimited, stosb
1782 kCycles for 100 * szLower2
981 kCycles for 100 * szUpper2
1983 kCycles for 100 * cmp al, new dest
1344 kCycles for 100 * table, new dest
2517 kCycles for 100 * cmp al, in place
2196 kCycles for 100 * xlat, new dest
1407 kCycles for 100 * table, source zero-delimited
2636 kCycles for 100 * table, source zero-delimited, stosb
1745 kCycles for 100 * szLower2
989 kCycles for 100 * szUpper2
75 bytes for cmp al, new dest
37 bytes for table, new dest
54 bytes for cmp al, in place
35 bytes for xlat, new dest
28 bytes for table, source zero-delimited
28 bytes for table, source zero-delimited, stosb
7 bytes for szLower2
7 bytes for szUpper2
AMD Ryzen 5 3400G with Radeon Vega Graphics (AVX2)
----------------------------------------------
-- test(1)
77637 cycles, rep(1000), code( 29) 0.asm: cmp
58463 cycles, rep(1000), code(288) 1.asm: table
85323 cycles, rep(1000), code(304) 2.asm: cmp+table
-- test(2)
90772 cycles, rep(1000), code( 29) 0.asm: cmp
68947 cycles, rep(1000), code(288) 1.asm: table
99731 cycles, rep(1000), code(304) 2.asm: cmp+table
-- test(3)
95322 cycles, rep(1000), code( 29) 0.asm: cmp
67971 cycles, rep(1000), code(288) 1.asm: table
95539 cycles, rep(1000), code(304) 2.asm: cmp+table
-- test(4)
98111 cycles, rep(1000), code( 29) 0.asm: cmp
73314 cycles, rep(1000), code(288) 1.asm: table
92986 cycles, rep(1000), code(304) 2.asm: cmp+table
total [1 .. 4], 1++
268695 cycles 1.asm: table
361842 cycles 0.asm: cmp
373579 cycles 2.asm: cmp+table
Hi Nidud,
Your lookup table function can be made faster by eliminating the first jump coming after test edx,edx :
include \masm32\include64\masm64rt.inc
.data
table_up label sbyte
i = 0
while i lt 256
if (i ge 'a') and (i le 'z')
db i and not ' '
else
db i
endif
i = i + 1
endm
s db 'This IS a test function.',0
.code
UpperCase PROC string:QWORD
lea r8,table_up
mov rax,rcx
dec rcx
_loop:
inc rcx
movzx edx,byte ptr [rcx]
mov r9b,[r8+rdx]
mov [rcx],r9b
test edx,edx
jnz _loop
ret
UpperCase ENDP
start PROC
invoke UpperCase,ADDR s
invoke StdOut,ADDR s
invoke ExitProcess,0
start ENDP
END
@all: thanks :thup:
I guess you all realise that szLower2 and szUpper2 look fast because they are in-place algos. That is, after one iteration, the remaining 99 iterations are performed on an already converted string. And that is way faster because it needs only one cmp, and jumps immediately.
I did a test to overcome this problem, as follows:
NameH equ szUpper2+szLower
TestH proc
mov ebx, AlgoLoops/2-1 ; loop e.g. 100x
align 4
.Repeat
invoke szLower2, offset Src
invoke szUpper2, offset Src
dec ebx
.Until Sign?
ret
TestH endp
Intel(R) Core(TM) i5-2450M CPU @ 2.50GHz (SSE4)
+19 of 20 tests valid, loop overhead is approx. 395/100 cycles
4988 kCycles for 100 * cmp al, new dest
1970 kCycles for 100 * table, new dest
3732 kCycles for 100 * cmp al, in place
2026 kCycles for 100 * xlat, new dest
1973 kCycles for 100 * table, source zero-delimited
1975 kCycles for 100 * table, source zero-delimited, stosb
2205 kCycles for 100 * szLower2
8608 kCycles for 100 * szUpper2+szLower
2669 kCycles for 100 * cmp al, new dest
1970 kCycles for 100 * table, new dest
3844 kCycles for 100 * cmp al, in place
2029 kCycles for 100 * xlat, new dest
1971 kCycles for 100 * table, source zero-delimited
1974 kCycles for 100 * table, source zero-delimited, stosb
2209 kCycles for 100 * szLower2
8605 kCycles for 100 * szUpper2+szLower
AMD Ryzen 5 2400G with Radeon Vega Graphics (SSE4)
3548 kCycles for 100 * cmp al, new dest
1512 kCycles for 100 * table, new dest
2841 kCycles for 100 * cmp al, in place
2469 kCycles for 100 * xlat, new dest
1507 kCycles for 100 * table, source zero-delimited
2846 kCycles for 100 * table, source zero-delimited, stosb
2360 kCycles for 100 * szLower2
1461 kCycles for 100 * szUpper2
2584 kCycles for 100 * cmp al, new dest
1503 kCycles for 100 * table, new dest
2795 kCycles for 100 * cmp al, in place
2459 kCycles for 100 * xlat, new dest
1543 kCycles for 100 * table, source zero-delimited
2903 kCycles for 100 * table, source zero-delimited, stosb
2195 kCycles for 100 * szLower2
1140 kCycles for 100 * szUpper2
2179 kCycles for 100 * cmp al, new dest
1464 kCycles for 100 * table, new dest
2761 kCycles for 100 * cmp al, in place
3423 kCycles for 100 * xlat, new dest
2219 kCycles for 100 * table, source zero-delimited
3214 kCycles for 100 * table, source zero-delimited, stosb
1955 kCycles for 100 * szLower2
1022 kCycles for 100 * szUpper2
75 bytes for cmp al, new dest
37 bytes for table, new dest
54 bytes for cmp al, in place
35 bytes for xlat, new dest
28 bytes for table, source zero-delimited
28 bytes for table, source zero-delimited, stosb
7 bytes for szLower2
7 bytes for szUpper2
--- ok ---
Intel(R) Core(TM)2 Duo CPU E8400 @ 3.00GHz (SSE4)
8409 kCycles for 100 * cmp al, new dest
2725 kCycles for 100 * table, new dest
7996 kCycles for 100 * cmp al, in place
2952 kCycles for 100 * xlat, new dest
8216 kCycles for 100 * LevelUp, in place
8272 kCycles for 100 * cmp al, new dest
2679 kCycles for 100 * table, new dest
8048 kCycles for 100 * cmp al, in place
2961 kCycles for 100 * xlat, new dest
8161 kCycles for 100 * LevelUp, in place
8218 kCycles for 100 * cmp al, new dest
2691 kCycles for 100 * table, new dest
8011 kCycles for 100 * cmp al, in place
2957 kCycles for 100 * xlat, new dest
8178 kCycles for 100 * LevelUp, in place
75 bytes for cmp al, new dest
65 bytes for table, new dest
54 bytes for cmp al, in place
35 bytes for xlat, new dest
66 bytes for LevelUp, in place
--- ok ---
Intel(R) Core(TM)2 Duo CPU E8400 @ 3.00GHz (SSE4)
8418 kCycles for 100 * cmp al, new dest
2556 kCycles for 100 * table, new dest
8051 kCycles for 100 * cmp al, in place
2949 kCycles for 100 * xlat, new dest
2958 kCycles for 100 * table, source zero-delimited
2968 kCycles for 100 * table, source zero-delimited, stosb
7249 kCycles for 100 * szLower
1976 kCycles for 100 * szUpper
5304 kCycles for 100 * cmp al, new dest
2352 kCycles for 100 * table, new dest
8046 kCycles for 100 * cmp al, in place
3097 kCycles for 100 * xlat, new dest
2939 kCycles for 100 * table, source zero-delimited
2949 kCycles for 100 * table, source zero-delimited, stosb
7098 kCycles for 100 * szLower
1975 kCycles for 100 * szUpper
5284 kCycles for 100 * cmp al, new dest
2316 kCycles for 100 * table, new dest
8296 kCycles for 100 * cmp al, in place
3099 kCycles for 100 * xlat, new dest
3115 kCycles for 100 * table, source zero-delimited
2970 kCycles for 100 * table, source zero-delimited, stosb
7079 kCycles for 100 * szLower
1969 kCycles for 100 * szUpper
75 bytes for cmp al, new dest
37 bytes for table, new dest
54 bytes for cmp al, in place
35 bytes for xlat, new dest
28 bytes for table, source zero-delimited
28 bytes for table, source zero-delimited, stosb
7 bytes for szLower
7 bytes for szUpper
--- ok ---