Print Page - ToUpper & ToLower timings

Title: ToUpper & ToLower timings
Post by: jj2007 on January 11, 2022, 01:06:27 PM

This collection is almost 8 years old, but apparently it has never been tested in the Lab :cool:

Intel(R) Core(TM) i5-2450M CPU @ 2.50GHz (SSE4)
+++17 of 20 tests valid, loop overhead is approx. 644/100 cycles

4074    kCycles for 100 * cmp al, new dest
1623    kCycles for 100 * table, new dest
3063    kCycles for 100 * cmp al, in place
1669    kCycles for 100 * xlat, new dest
5571    kCycles for 100 * LevelUp, in place

4528    kCycles for 100 * cmp al, new dest
1618    kCycles for 100 * table, new dest
3631    kCycles for 100 * cmp al, in place
2802    kCycles for 100 * xlat, new dest
5553    kCycles for 100 * LevelUp, in place

4569    kCycles for 100 * cmp al, new dest
1633    kCycles for 100 * table, new dest
3076    kCycles for 100 * cmp al, in place
1669    kCycles for 100 * xlat, new dest
5569    kCycles for 100 * LevelUp, in place

75      bytes for cmp al, new dest
65      bytes for table, new dest
54      bytes for cmp al, in place
35      bytes for xlat, new dest
66      bytes for LevelUp, in place

See here for similar stuff at the FreeBasic forum (https://www.freebasic.net/forum/viewtopic.php?f=3&t=31391&p=288920&sid=487127adb17ebcbd08c5e7493258153e#p288917)

Title: Re: ToUpper & ToLower timings
Post by: LiaoMi on January 11, 2022, 05:33:57 PM

Hi :tongue:,

Code Select

11th Gen Intel(R) Core(TM) i7-11800H @ 2.30GHz (SSE4)

1378    kCycles for 100 * cmp al, new dest
660     kCycles for 100 * table, new dest
1142    kCycles for 100 * cmp al, in place
960     kCycles for 100 * xlat, new dest
1395    kCycles for 100 * LevelUp, in place

1425    kCycles for 100 * cmp al, new dest
674     kCycles for 100 * table, new dest
1185    kCycles for 100 * cmp al, in place
1006    kCycles for 100 * xlat, new dest
1425    kCycles for 100 * LevelUp, in place

1365    kCycles for 100 * cmp al, new dest
671     kCycles for 100 * table, new dest
1159    kCycles for 100 * cmp al, in place
1012    kCycles for 100 * xlat, new dest
1472    kCycles for 100 * LevelUp, in place

75      bytes for cmp al, new dest
65      bytes for table, new dest
54      bytes for cmp al, in place
35      bytes for xlat, new dest
66      bytes for LevelUp, in place


--- ok ---

Title: Re: ToUpper & ToLower timings
Post by: daydreamer on January 11, 2022, 11:22:03 PM

Code Select

Intel(R) Core(TM) i5-7200U CPU @ 2.50GHz (SSE4)
++++++14 of 20 tests valid, loop overhead is approx. 323/100 cycles

4207    kCycles for 100 * cmp al, new dest
1550    kCycles for 100 * table, new dest
1980    kCycles for 100 * cmp al, in place
1785    kCycles for 100 * xlat, new dest
3248    kCycles for 100 * LevelUp, in place

5186    kCycles for 100 * cmp al, new dest
1568    kCycles for 100 * table, new dest
2188    kCycles for 100 * cmp al, in place
1821    kCycles for 100 * xlat, new dest
3231    kCycles for 100 * LevelUp, in place

5047    kCycles for 100 * cmp al, new dest
1667    kCycles for 100 * table, new dest
1964    kCycles for 100 * cmp al, in place
1777    kCycles for 100 * xlat, new dest
3310    kCycles for 100 * LevelUp, in place

75      bytes for cmp al, new dest
65      bytes for table, new dest
54      bytes for cmp al, in place
35      bytes for xlat, new dest
66      bytes for LevelUp, in place


-

Title: Re: ToUpper & ToLower timings
Post by: TimoVJL on January 12, 2022, 12:03:47 AM

Code Select

AMD Ryzen 5 3400G with Radeon Vega Graphics     (SSE4)
+19 of 20 tests valid, loop overhead is approx. 322/100 cycles

3438    kCycles for 100 * cmp al, new dest
1883    kCycles for 100 * table, new dest
3641    kCycles for 100 * cmp al, in place
2602    kCycles for 100 * xlat, new dest
2967    kCycles for 100 * LevelUp, in place

3917    kCycles for 100 * cmp al, new dest
1882    kCycles for 100 * table, new dest
3654    kCycles for 100 * cmp al, in place
2513    kCycles for 100 * xlat, new dest
2975    kCycles for 100 * LevelUp, in place

3839    kCycles for 100 * cmp al, new dest
1879    kCycles for 100 * table, new dest
3644    kCycles for 100 * cmp al, in place
2618    kCycles for 100 * xlat, new dest
2903    kCycles for 100 * LevelUp, in place

75      bytes for cmp al, new dest
65      bytes for table, new dest
54      bytes for cmp al, in place
35      bytes for xlat, new dest
66      bytes for LevelUp, in place

Title: Re: ToUpper & ToLower timings
Post by: coaster on January 12, 2022, 12:35:09 AM

Code Select

Intel(R) Core(TM) i7-7660U CPU @ 2.50GHz (SSE4)
+++++15 of 20 tests valid, loop overhead is approx. 402/100 cycles

3153    kCycles for 100 * cmp al, new dest
1174    kCycles for 100 * table, new dest
1435    kCycles for 100 * cmp al, in place
1584    kCycles for 100 * xlat, new dest
3355    kCycles for 100 * LevelUp, in place

3770    kCycles for 100 * cmp al, new dest
1191    kCycles for 100 * table, new dest
1510    kCycles for 100 * cmp al, in place
1737    kCycles for 100 * xlat, new dest
4648    kCycles for 100 * LevelUp, in place

3725    kCycles for 100 * cmp al, new dest
1334    kCycles for 100 * table, new dest
1969    kCycles for 100 * cmp al, in place
2227    kCycles for 100 * xlat, new dest
4367    kCycles for 100 * LevelUp, in place

75      bytes for cmp al, new dest
65      bytes for table, new dest
54      bytes for cmp al, in place
35      bytes for xlat, new dest
66      bytes for LevelUp, in place

Title: Re: ToUpper & ToLower timings
Post by: jj2007 on January 12, 2022, 12:48:01 AM

And the winner is...

Code Select

		xor ecx, ecx
		align 4
		.Repeat
			movzx eax, byte ptr [edx+ecx]		; edx is source string
			movzx eax, byte ptr TheTable[eax]	; mov al is equally fast on i5
			mov [edi+ecx], al			; edi is destination string
			inc ecx
		.Until ecx>=SrcBytes

I've added two more table-based solutions, both fast and 28 bytes short:

Code Select

Intel(R) Core(TM) i5-2450M CPU @ 2.50GHz (SSE4)
++++16 of 20 tests valid, loop overhead is approx. 600/100 cycles

4062    kCycles for 100 * cmp al, new dest
1622    kCycles for 100 * table, new dest
3075    kCycles for 100 * cmp al, in place
1668    kCycles for 100 * xlat, new dest
1638    kCycles for 100 * table, source zero-delimited
1634    kCycles for 100 * table, source zero-delimited, stosb

4057    kCycles for 100 * cmp al, new dest
1617    kCycles for 100 * table, new dest
3058    kCycles for 100 * cmp al, in place
1666    kCycles for 100 * xlat, new dest
1620    kCycles for 100 * table, source zero-delimited
1623    kCycles for 100 * table, source zero-delimited, stosb

4089    kCycles for 100 * cmp al, new dest
1607    kCycles for 100 * table, new dest
3191    kCycles for 100 * cmp al, in place
1702    kCycles for 100 * xlat, new dest
1621    kCycles for 100 * table, source zero-delimited
1626    kCycles for 100 * table, source zero-delimited, stosb

75      bytes for cmp al, new dest
37      bytes for table, new dest
54      bytes for cmp al, in place
35      bytes for xlat, new dest
28      bytes for table, source zero-delimited
28      bytes for table, source zero-delimited, stosb

Title: Re: ToUpper & ToLower timings
Post by: TimoVJL on January 12, 2022, 01:36:18 AM

Code Select

AMD Ryzen 5 3400G with Radeon Vega Graphics     (SSE4)

3494    kCycles for 100 * cmp al, new dest
1657    kCycles for 100 * table, new dest
2839    kCycles for 100 * cmp al, in place
2547    kCycles for 100 * xlat, new dest
1672    kCycles for 100 * table, source zero-delimited
2977    kCycles for 100 * table, source zero-delimited, stosb

3485    kCycles for 100 * cmp al, new dest
1719    kCycles for 100 * table, new dest
2883    kCycles for 100 * cmp al, in place
2515    kCycles for 100 * xlat, new dest
1635    kCycles for 100 * table, source zero-delimited
2971    kCycles for 100 * table, source zero-delimited, stosb

3440    kCycles for 100 * cmp al, new dest
1681    kCycles for 100 * table, new dest
2804    kCycles for 100 * cmp al, in place
2502    kCycles for 100 * xlat, new dest
1652    kCycles for 100 * table, source zero-delimited
2938    kCycles for 100 * table, source zero-delimited, stosb

75      bytes for cmp al, new dest
37      bytes for table, new dest
54      bytes for cmp al, in place
35      bytes for xlat, new dest
28      bytes for table, source zero-delimited
28      bytes for table, source zero-delimited, stosb

Title: Re: ToUpper & ToLower timings
Post by: six_L on January 12, 2022, 02:36:47 AM

Code Select

Intel(R) Core(TM) i5-9400H CPU @ 2.50GHz (SSE4)

2941    kCycles for 100 * cmp al, new dest
999     kCycles for 100 * table, new dest
1898    kCycles for 100 * cmp al, in place
1177    kCycles for 100 * xlat, new dest
1183    kCycles for 100 * table, source zero-delimited
1257    kCycles for 100 * table, source zero-delimited, stosb

2994    kCycles for 100 * cmp al, new dest
890     kCycles for 100 * table, new dest
1902    kCycles for 100 * cmp al, in place
1161    kCycles for 100 * xlat, new dest
1171    kCycles for 100 * table, source zero-delimited
1187    kCycles for 100 * table, source zero-delimited, stosb

2946    kCycles for 100 * cmp al, new dest
909     kCycles for 100 * table, new dest
1923    kCycles for 100 * cmp al, in place
1236    kCycles for 100 * xlat, new dest
1352    kCycles for 100 * table, source zero-delimited
1157    kCycles for 100 * table, source zero-delimited, stosb

75      bytes for cmp al, new dest
37      bytes for table, new dest
54      bytes for cmp al, in place
35      bytes for xlat, new dest
28      bytes for table, source zero-delimited
28      bytes for table, source zero-delimited, stosb


--- ok ---

Title: Re: ToUpper & ToLower timings
Post by: LiaoMi on January 12, 2022, 02:37:28 AM

Code Select

11th Gen Intel(R) Core(TM) i7-11800H @ 2.30GHz (SSE4)
-19 of 20 tests valid, loop overhead is approx. 78/100 cycles

1397    kCycles for 100 * cmp al, new dest
662     kCycles for 100 * table, new dest
1133    kCycles for 100 * cmp al, in place
966     kCycles for 100 * xlat, new dest
850     kCycles for 100 * table, source zero-delimited
907     kCycles for 100 * table, source zero-delimited, stosb

1348    kCycles for 100 * cmp al, new dest
682     kCycles for 100 * table, new dest
1088    kCycles for 100 * cmp al, in place
924     kCycles for 100 * xlat, new dest
864     kCycles for 100 * table, source zero-delimited
884     kCycles for 100 * table, source zero-delimited, stosb

1389    kCycles for 100 * cmp al, new dest
681     kCycles for 100 * table, new dest
1069    kCycles for 100 * cmp al, in place
936     kCycles for 100 * xlat, new dest
852     kCycles for 100 * table, source zero-delimited
861     kCycles for 100 * table, source zero-delimited, stosb

75      bytes for cmp al, new dest
37      bytes for table, new dest
54      bytes for cmp al, in place
35      bytes for xlat, new dest
28      bytes for table, source zero-delimited
28      bytes for table, source zero-delimited, stosb


--- ok ---

Code Select

11th Gen Intel(R) Core(TM) i7-11800H @ 2.30GHz (SSE4)

1354    kCycles for 100 * cmp al, new dest
652     kCycles for 100 * table, new dest
1084    kCycles for 100 * cmp al, in place
945     kCycles for 100 * xlat, new dest
843     kCycles for 100 * table, source zero-delimited
879     kCycles for 100 * table, source zero-delimited, stosb

1402    kCycles for 100 * cmp al, new dest
671     kCycles for 100 * table, new dest
1110    kCycles for 100 * cmp al, in place
974     kCycles for 100 * xlat, new dest
916     kCycles for 100 * table, source zero-delimited
939     kCycles for 100 * table, source zero-delimited, stosb

1357    kCycles for 100 * cmp al, new dest
648     kCycles for 100 * table, new dest
1239    kCycles for 100 * cmp al, in place
995     kCycles for 100 * xlat, new dest
871     kCycles for 100 * table, source zero-delimited
871     kCycles for 100 * table, source zero-delimited, stosb

75      bytes for cmp al, new dest
37      bytes for table, new dest
54      bytes for cmp al, in place
35      bytes for xlat, new dest
28      bytes for table, source zero-delimited
28      bytes for table, source zero-delimited, stosb

Title: Re: ToUpper & ToLower timings
Post by: Biterider on January 12, 2022, 07:54:16 AM

Code Select

Intel(R) Core(TM) i7-4770K CPU @ 3.50GHz (SSE4)

4736    kCycles for 100 * cmp al, new dest
1600    kCycles for 100 * table, new dest
2282    kCycles for 100 * cmp al, in place
2394    kCycles for 100 * xlat, new dest
1705    kCycles for 100 * table, source zero-delimited
2012    kCycles for 100 * table, source zero-delimited, stosb

4969    kCycles for 100 * cmp al, new dest
1782    kCycles for 100 * table, new dest
2606    kCycles for 100 * cmp al, in place
2197    kCycles for 100 * xlat, new dest
1927    kCycles for 100 * table, source zero-delimited
2170    kCycles for 100 * table, source zero-delimited, stosb

4848    kCycles for 100 * cmp al, new dest
3238    kCycles for 100 * table, new dest
2276    kCycles for 100 * cmp al, in place
1797    kCycles for 100 * xlat, new dest
1692    kCycles for 100 * table, source zero-delimited
1916    kCycles for 100 * table, source zero-delimited, stosb

75      bytes for cmp al, new dest
37      bytes for table, new dest
54      bytes for cmp al, in place
35      bytes for xlat, new dest
28      bytes for table, source zero-delimited
28      bytes for table, source zero-delimited, stosb


-

Biterider

Title: Re: ToUpper & ToLower timings
Post by: hutch-- on January 12, 2022, 11:04:35 AM

I just wonder how many terabytes you would need to see the difference. :tongue:

Intel(R) Core(TM) i7-5820K CPU @ 3.30GHz (SSE4)

3931 kCycles for 100 * cmp al, new dest
1224 kCycles for 100 * table, new dest
1869 kCycles for 100 * cmp al, in place
2007 kCycles for 100 * xlat, new dest
5581 kCycles for 100 * LevelUp, in place

4133 kCycles for 100 * cmp al, new dest
1225 kCycles for 100 * table, new dest
2107 kCycles for 100 * cmp al, in place
1735 kCycles for 100 * xlat, new dest
5706 kCycles for 100 * LevelUp, in place

4605 kCycles for 100 * cmp al, new dest
1605 kCycles for 100 * table, new dest
1844 kCycles for 100 * cmp al, in place
1555 kCycles for 100 * xlat, new dest
5506 kCycles for 100 * LevelUp, in place

75 bytes for cmp al, new dest
65 bytes for table, new dest
54 bytes for cmp al, in place
35 bytes for xlat, new dest
66 bytes for LevelUp, in place

Title: Re: ToUpper & ToLower timings
Post by: NoCforMe on January 12, 2022, 11:09:51 AM

Pardon me as I intrude here, as an amateur, but let me get this straight: we're talking about ToUpper() which turns any ASCII alpha characters into uppercase? Is that right?

If so, the method being used seems waaay too complex. Here's mine, which I've been using for decades now:

Code Select

;============================================
; ToUpper()
;
; If character in AL is alphabetic, uppercases it.
;============================================

ToUpper		PROC
	CMP	AL, 'a'
	JB	tu99
	CMP	AL, 'z'
	JA	tu99
	AND	AL, 5FH
tu99:	RET

ToUpper		ENDP

Just twiddle a couple bits is all. (ToLower() would operate similarly, by setting rather than clearing bits.)

Any objections? or am I missing something obvious here?

Title: Re: ToUpper & ToLower timings
Post by: jj2007 on January 12, 2022, 11:30:39 AM

NoCforMe,

Your routine works for a single char, while the algos in the testbed translate a whole string to uppercase. The first one called "cmp al, new dest" does exactly what yours is doing. However, this one (using a table) is over twice as fast, and very short:

Code Select

	mov edx, offset Src
	mov edi, offset Dest
	xor eax, eax
	align 4
	.While 1
		movzx ecx, byte ptr [edx+eax]
		jecxz @out
		movzx ecx, byte ptr TheTable[ecx]
		mov [edi+eax], cl
		inc eax
	.Endw
	@out:

As Hutch writes above, there is rarely a need for so much speed. We are doing this just for fun :cool:

Title: Re: ToUpper & ToLower timings
Post by: Vortex on January 13, 2022, 04:48:33 AM

Hi Jochen,

Thanks, I liked your code creating the lookup table :thumbsup: Better and more elegant than my handcrafted tables.

Title: Re: ToUpper & ToLower timings
Post by: hutch-- on January 13, 2022, 04:55:10 AM

JJ,

If you have the time, clock these two out of the library.

; ¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤

OPTION PROLOGUE:NONE
OPTION EPILOGUE:NONE

align 4

szLower proc text:DWORD

; -----------------------------
; converts string to lower case
; invoke szLower,ADDR szString
; -----------------------------

mov eax, [esp+4]
dec eax

@@:
add eax, 1
cmp BYTE PTR [eax], 0
je @F
cmp BYTE PTR [eax], "A"
jb @B
cmp BYTE PTR [eax], "Z"
ja @B
add BYTE PTR [eax], 32
jmp @B
@@:

mov eax, [esp+4]

ret 4

szLower endp

OPTION PROLOGUE:PrologueDef
OPTION EPILOGUE:EpilogueDef

; ¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤

; ¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤

OPTION PROLOGUE:NONE
OPTION EPILOGUE:NONE

align 4

szUpper proc text:DWORD

; -----------------------------
; converts string to upper case
; invoke szUpper,ADDR szString
; -----------------------------

mov eax, [esp+4]
dec eax

@@:
add eax, 1
cmp BYTE PTR [eax], 0
je @F
cmp BYTE PTR [eax], "a"
jb @B
cmp BYTE PTR [eax], "z"
ja @B
sub BYTE PTR [eax], 32
jmp @B
@@:

mov eax, [esp+4]

ret 4

szUpper endp

OPTION PROLOGUE:PrologueDef
OPTION EPILOGUE:EpilogueDef

; ¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤

Title: Re: ToUpper & ToLower timings
Post by: jj2007 on January 13, 2022, 06:24:14 AM

Quote from: Vortex on January 13, 2022, 04:48:33 AM
Hi Jochen,

Thanks, I liked your code creating the lookup table :thumbsup: Better and more elegant than my handcrafted tables.

Hi Erol,

It's called "cheating", but it fits the purpose ;-)

Title: Re: ToUpper & ToLower timings
Post by: jj2007 on January 13, 2022, 06:37:28 AM

Quote from: hutch-- on January 13, 2022, 04:55:10 AM
JJ,

If you have the time, clock these two out of the library.

I have a suspicion that szUpper profits from the fact that it converts "in place". So in run #2 it's already all uppercase. But why is szLower so slow then? Mysteries :sad:

Code Select

Intel(R) Core(TM) i5-2450M CPU @ 2.50GHz (SSE4)
+19 of 20 tests valid, loop overhead is approx. 631/100 cycles

4066    kCycles for 100 * cmp al, new dest
1606    kCycles for 100 * table, new dest
3050    kCycles for 100 * cmp al, in place
1659    kCycles for 100 * xlat, new dest
1614    kCycles for 100 * table, source zero-delimited
1612    kCycles for 100 * table, source zero-delimited, stosb
6310    kCycles for 100 * szLower
1610    kCycles for 100 * szUpper

2198    kCycles for 100 * cmp al, new dest
1626    kCycles for 100 * table, new dest
3139    kCycles for 100 * cmp al, in place
1657    kCycles for 100 * xlat, new dest
1610    kCycles for 100 * table, source zero-delimited
1631    kCycles for 100 * table, source zero-delimited, stosb
6308    kCycles for 100 * szLower
1615    kCycles for 100 * szUpper

2192    kCycles for 100 * cmp al, new dest
1613    kCycles for 100 * table, new dest
3049    kCycles for 100 * cmp al, in place
1667    kCycles for 100 * xlat, new dest
1613    kCycles for 100 * table, source zero-delimited
1617    kCycles for 100 * table, source zero-delimited, stosb
6359    kCycles for 100 * szLower
1612    kCycles for 100 * szUpper

75      bytes for cmp al, new dest
37      bytes for table, new dest
54      bytes for cmp al, in place
35      bytes for xlat, new dest
28      bytes for table, source zero-delimited
28      bytes for table, source zero-delimited, stosb
7       bytes for szLower
7       bytes for szUpper

Title: Re: ToUpper & ToLower timings
Post by: nidud on January 13, 2022, 07:23:05 AM

deleted

Title: Re: ToUpper & ToLower timings
Post by: hutch-- on January 13, 2022, 09:33:43 AM

I have no idea why they differ so much.

Intel(R) Core(TM) i7-5820K CPU @ 3.30GHz (SSE4)

3954 kCycles for 100 * cmp al, new dest
1281 kCycles for 100 * table, new dest
1896 kCycles for 100 * cmp al, in place
1603 kCycles for 100 * xlat, new dest
1490 kCycles for 100 * table, source zero-delimited
1693 kCycles for 100 * table, source zero-delimited, stosb
6283 kCycles for 100 * szLower
1541 kCycles for 100 * szUpper

1989 kCycles for 100 * cmp al, new dest
1304 kCycles for 100 * table, new dest
1897 kCycles for 100 * cmp al, in place
1603 kCycles for 100 * xlat, new dest
1482 kCycles for 100 * table, source zero-delimited
1695 kCycles for 100 * table, source zero-delimited, stosb
6282 kCycles for 100 * szLower
1618 kCycles for 100 * szUpper

1990 kCycles for 100 * cmp al, new dest
1280 kCycles for 100 * table, new dest
1898 kCycles for 100 * cmp al, in place
1608 kCycles for 100 * xlat, new dest
1479 kCycles for 100 * table, source zero-delimited
1693 kCycles for 100 * table, source zero-delimited, stosb
6295 kCycles for 100 * szLower
1618 kCycles for 100 * szUpper

75 bytes for cmp al, new dest
37 bytes for table, new dest
54 bytes for cmp al, in place
35 bytes for xlat, new dest
28 bytes for table, source zero-delimited
28 bytes for table, source zero-delimited, stosb
7 bytes for szLower
7 bytes for szUpper

Title: Re: ToUpper & ToLower timings
Post by: jj2007 on January 13, 2022, 11:03:17 AM

Quote from: hutch-- on January 13, 2022, 09:33:43 AM
I have no idea why they differ so much.

It's the "in place" thing:
szLower:
cmp BYTE PTR [eax], 0 ; after the first iteration, it's a lowerstring
je @F
cmp BYTE PTR [eax], "A" ; nope
jb @B
cmp BYTE PTR [eax], "Z" ; this branch will be taken
ja @B

szUpper:
cmp BYTE PTR [eax], 0 ; after the first iteration, it's an upperstring
je @F
cmp BYTE PTR [eax], "a" ; this branch will be taken
jb @B
cmp BYTE PTR [eax], "z"
ja @B

Now this is weird... a big improvement for szLower, partly because I changed the AZ order, but mostly because of a single instruction change:

Code Select

Intel(R) Core(TM) i5-2450M CPU @ 2.50GHz (SSE4)

4076    kCycles for 100 * cmp al, new dest
1614    kCycles for 100 * table, new dest
3152    kCycles for 100 * cmp al, in place
1657    kCycles for 100 * xlat, new dest
1613    kCycles for 100 * table, source zero-delimited
1617    kCycles for 100 * table, source zero-delimited, stosb
1827    kCycles for 100 * szLower2
1635    kCycles for 100 * szUpper2

2189    kCycles for 100 * cmp al, new dest
1613    kCycles for 100 * table, new dest
3143    kCycles for 100 * cmp al, in place
1688    kCycles for 100 * xlat, new dest
1623    kCycles for 100 * table, source zero-delimited
1622    kCycles for 100 * table, source zero-delimited, stosb
1806    kCycles for 100 * szLower2
1616    kCycles for 100 * szUpper2

Big improvement for szLower, right?

Code Select

OPTION PROLOGUE:NONE
OPTION EPILOGUE:NONE

align 4

szLower2 proc text:DWORD

  ; -----------------------------
  ; converts string to lower case
  ; invoke szLower,ADDR szString
  ; -----------------------------

    mov eax, [esp+4]
    dec eax
; align 2	; much, much slower on my i5
  @@:
  if 1
    inc eax	; much, much faster than add eax, 1
  else
    add eax, 1
  endif 
    cmp BYTE PTR [eax], 0
    je @F
    cmp BYTE PTR [eax], "Z"
    ja @B
    cmp BYTE PTR [eax], "A"
    jb @B
    or BYTE PTR [eax], 32
    jmp @B
  @@:

    mov eax, [esp+4]

    ret 4

szLower2 endp

Btw the source assembles without MasmBasic, it's plain Masm32 SDK :cool:

Title: Re: ToUpper & ToLower timings
Post by: LiaoMi on January 14, 2022, 05:13:03 AM

Quote from: jj2007 on January 13, 2022, 11:03:17 AM
Quote from: hutch-- on January 13, 2022, 09:33:43 AM
I have no idea why they differ so much.

It's the "in place" thing:
szLower:
cmp BYTE PTR [eax], 0 ; after the first iteration, it's a lowerstring
je @F
cmp BYTE PTR [eax], "A" ; nope
jb @B
cmp BYTE PTR [eax], "Z" ; this branch will be taken
ja @B

szUpper:
cmp BYTE PTR [eax], 0 ; after the first iteration, it's an upperstring
je @F
cmp BYTE PTR [eax], "a" ; this branch will be taken
jb @B
cmp BYTE PTR [eax], "z"
ja @B

Now this is weird... a big improvement for szLower, partly because I changed the AZ order, but mostly because of a single instruction change:

Big improvement for szLower, right?
Code Select Expand
OPTION PROLOGUE:NONE OPTION EPILOGUE:NONE align 4 szLower2 proc text:DWORD ; ----------------------------- ; converts string to lower case ; invoke szLower,ADDR szString ; ----------------------------- mov eax, [esp+4] dec eax ; align 2 ; much, much slower on my i5 @@: if 1 inc eax ; much, much faster than add eax, 1 else add eax, 1 endif cmp BYTE PTR [eax], 0 je @F cmp BYTE PTR [eax], "Z" ja @B cmp BYTE PTR [eax], "A" jb @B or BYTE PTR [eax], 32 jmp @B @@: mov eax, [esp+4] ret 4 szLower2 endp

Btw the source assembles without MasmBasic, it's plain Masm32 SDK :cool:

Code Select

11th Gen Intel(R) Core(TM) i7-11800H @ 2.30GHz (SSE4)

1441    kCycles for 100 * cmp al, new dest
705     kCycles for 100 * table, new dest
1147    kCycles for 100 * cmp al, in place
981     kCycles for 100 * xlat, new dest
887     kCycles for 100 * table, source zero-delimited
914     kCycles for 100 * table, source zero-delimited, stosb
1088    kCycles for 100 * szLower2
952     kCycles for 100 * szUpper2

938     kCycles for 100 * cmp al, new dest
706     kCycles for 100 * table, new dest
1142    kCycles for 100 * cmp al, in place
974     kCycles for 100 * xlat, new dest
963     kCycles for 100 * table, source zero-delimited
914     kCycles for 100 * table, source zero-delimited, stosb
1107    kCycles for 100 * szLower2
974     kCycles for 100 * szUpper2

1013    kCycles for 100 * cmp al, new dest
702     kCycles for 100 * table, new dest
1162    kCycles for 100 * cmp al, in place
978     kCycles for 100 * xlat, new dest
893     kCycles for 100 * table, source zero-delimited
956     kCycles for 100 * table, source zero-delimited, stosb
1124    kCycles for 100 * szLower2
953     kCycles for 100 * szUpper2

75      bytes for cmp al, new dest
37      bytes for table, new dest
54      bytes for cmp al, in place
35      bytes for xlat, new dest
28      bytes for table, source zero-delimited
28      bytes for table, source zero-delimited, stosb
7       bytes for szLower2
7       bytes for szUpper2


--- ok ---

Title: Re: ToUpper & ToLower timings
Post by: LiaoMi on January 14, 2022, 05:14:40 AM

Quote from: nidud on January 13, 2022, 07:23:05 AM
They will probably be similar with a small advantage for the table.

Code Select

11th Gen Intel(R) Core(TM) i7-11800H @ 2.30GHz (AVX512)
----------------------------------------------
-- test(1)
    29771 cycles, rep(1000), code( 29) 0.asm: cmp
    38149 cycles, rep(1000), code(288) 1.asm: table
    28763 cycles, rep(1000), code(304) 2.asm: cmp+table
-- test(2)
    29992 cycles, rep(1000), code( 29) 0.asm: cmp
    40331 cycles, rep(1000), code(288) 1.asm: table
    25967 cycles, rep(1000), code(304) 2.asm: cmp+table
-- test(3)
    29737 cycles, rep(1000), code( 29) 0.asm: cmp
    40168 cycles, rep(1000), code(288) 1.asm: table
    29103 cycles, rep(1000), code(304) 2.asm: cmp+table
-- test(4)
    30073 cycles, rep(1000), code( 29) 0.asm: cmp
    41148 cycles, rep(1000), code(288) 1.asm: table
    28872 cycles, rep(1000), code(304) 2.asm: cmp+table

total [1 .. 4], 1++
   112705 cycles 2.asm: cmp+table
   119573 cycles 0.asm: cmp
   159796 cycles 1.asm: table
hit any key to continue...

Title: Re: ToUpper & ToLower timings
Post by: TimoVJL on January 14, 2022, 05:20:11 AM

Code Select

AMD Ryzen 5 3400G with Radeon Vega Graphics     (SSE4)

3243    kCycles for 100 * cmp al, new dest
1348    kCycles for 100 * table, new dest
2541    kCycles for 100 * cmp al, in place
2202    kCycles for 100 * xlat, new dest
1419    kCycles for 100 * table, source zero-delimited
2641    kCycles for 100 * table, source zero-delimited, stosb
1782    kCycles for 100 * szLower2
1015    kCycles for 100 * szUpper2

1963    kCycles for 100 * cmp al, new dest
1362    kCycles for 100 * table, new dest
2541    kCycles for 100 * cmp al, in place
2201    kCycles for 100 * xlat, new dest
1352    kCycles for 100 * table, source zero-delimited
2632    kCycles for 100 * table, source zero-delimited, stosb
1782    kCycles for 100 * szLower2
981     kCycles for 100 * szUpper2

1983    kCycles for 100 * cmp al, new dest
1344    kCycles for 100 * table, new dest
2517    kCycles for 100 * cmp al, in place
2196    kCycles for 100 * xlat, new dest
1407    kCycles for 100 * table, source zero-delimited
2636    kCycles for 100 * table, source zero-delimited, stosb
1745    kCycles for 100 * szLower2
989     kCycles for 100 * szUpper2

75      bytes for cmp al, new dest
37      bytes for table, new dest
54      bytes for cmp al, in place
35      bytes for xlat, new dest
28      bytes for table, source zero-delimited
28      bytes for table, source zero-delimited, stosb
7       bytes for szLower2
7       bytes for szUpper2

Code Select

AMD Ryzen 5 3400G with Radeon Vega Graphics     (AVX2)
----------------------------------------------
-- test(1)
    77637 cycles, rep(1000), code( 29) 0.asm: cmp
    58463 cycles, rep(1000), code(288) 1.asm: table
    85323 cycles, rep(1000), code(304) 2.asm: cmp+table
-- test(2)
    90772 cycles, rep(1000), code( 29) 0.asm: cmp
    68947 cycles, rep(1000), code(288) 1.asm: table
    99731 cycles, rep(1000), code(304) 2.asm: cmp+table
-- test(3)
    95322 cycles, rep(1000), code( 29) 0.asm: cmp
    67971 cycles, rep(1000), code(288) 1.asm: table
    95539 cycles, rep(1000), code(304) 2.asm: cmp+table
-- test(4)
    98111 cycles, rep(1000), code( 29) 0.asm: cmp
    73314 cycles, rep(1000), code(288) 1.asm: table
    92986 cycles, rep(1000), code(304) 2.asm: cmp+table

total [1 .. 4], 1++
   268695 cycles 1.asm: table
   361842 cycles 0.asm: cmp
   373579 cycles 2.asm: cmp+table

Title: Re: ToUpper & ToLower timings
Post by: Vortex on January 14, 2022, 06:03:05 AM

Hi Nidud,

Your lookup table function can be made faster by eliminating the first jump coming after test edx,edx :

Code Select

include \masm32\include64\masm64rt.inc

    .data

table_up label sbyte

i = 0

while i lt 256

    if (i ge 'a') and (i le 'z')
        db i and not ' '
    else
        db i
    endif
    i = i + 1

    endm

s db 'This IS a test function.',0

.code

UpperCase PROC string:QWORD

    lea     r8,table_up
    mov     rax,rcx
    dec     rcx

_loop:

    inc     rcx
    movzx   edx,byte ptr [rcx]

    mov     r9b,[r8+rdx]
    mov     [rcx],r9b
    test    edx,edx
    jnz     _loop
    ret

UpperCase ENDP

start PROC

    invoke  UpperCase,ADDR s
    invoke  StdOut,ADDR s
    invoke  ExitProcess,0

start ENDP

END

Title: Re: ToUpper & ToLower timings
Post by: jj2007 on January 14, 2022, 06:29:43 AM

@all: thanks :thup:

I guess you all realise that szLower2 and szUpper2 look fast because they are in-place algos. That is, after one iteration, the remaining 99 iterations are performed on an already converted string. And that is way faster because it needs only one cmp, and jumps immediately.

I did a test to overcome this problem, as follows:

Code Select

NameH equ szUpper2+szLower
TestH proc
  mov ebx, AlgoLoops/2-1	; loop e.g. 100x
  align 4
  .Repeat
	invoke szLower2, offset Src
	invoke szUpper2, offset Src
	dec ebx
  .Until Sign?
  ret
TestH endp

Code Select

Intel(R) Core(TM) i5-2450M CPU @ 2.50GHz (SSE4)
+19 of 20 tests valid, loop overhead is approx. 395/100 cycles

4988    kCycles for 100 * cmp al, new dest
1970    kCycles for 100 * table, new dest
3732    kCycles for 100 * cmp al, in place
2026    kCycles for 100 * xlat, new dest
1973    kCycles for 100 * table, source zero-delimited
1975    kCycles for 100 * table, source zero-delimited, stosb
2205    kCycles for 100 * szLower2
8608    kCycles for 100 * szUpper2+szLower

2669    kCycles for 100 * cmp al, new dest
1970    kCycles for 100 * table, new dest
3844    kCycles for 100 * cmp al, in place
2029    kCycles for 100 * xlat, new dest
1971    kCycles for 100 * table, source zero-delimited
1974    kCycles for 100 * table, source zero-delimited, stosb
2209    kCycles for 100 * szLower2
8605    kCycles for 100 * szUpper2+szLower

Title: Re: ToUpper & ToLower timings
Post by: guga on January 14, 2022, 04:49:58 PM

Code Select

AMD Ryzen 5 2400G with Radeon Vega Graphics     (SSE4)

3548    kCycles for 100 * cmp al, new dest
1512    kCycles for 100 * table, new dest
2841    kCycles for 100 * cmp al, in place
2469    kCycles for 100 * xlat, new dest
1507    kCycles for 100 * table, source zero-delimited
2846    kCycles for 100 * table, source zero-delimited, stosb
2360    kCycles for 100 * szLower2
1461    kCycles for 100 * szUpper2

2584    kCycles for 100 * cmp al, new dest
1503    kCycles for 100 * table, new dest
2795    kCycles for 100 * cmp al, in place
2459    kCycles for 100 * xlat, new dest
1543    kCycles for 100 * table, source zero-delimited
2903    kCycles for 100 * table, source zero-delimited, stosb
2195    kCycles for 100 * szLower2
1140    kCycles for 100 * szUpper2

2179    kCycles for 100 * cmp al, new dest
1464    kCycles for 100 * table, new dest
2761    kCycles for 100 * cmp al, in place
3423    kCycles for 100 * xlat, new dest
2219    kCycles for 100 * table, source zero-delimited
3214    kCycles for 100 * table, source zero-delimited, stosb
1955    kCycles for 100 * szLower2
1022    kCycles for 100 * szUpper2

75      bytes for cmp al, new dest
37      bytes for table, new dest
54      bytes for cmp al, in place
35      bytes for xlat, new dest
28      bytes for table, source zero-delimited
28      bytes for table, source zero-delimited, stosb
7       bytes for szLower2
7       bytes for szUpper2


--- ok ---

Title: Re: ToUpper & ToLower timings
Post by: quarantined on January 16, 2022, 08:00:23 AM

Intel(R) Core(TM)2 Duo CPU E8400 @ 3.00GHz (SSE4)

8409 kCycles for 100 * cmp al, new dest
2725 kCycles for 100 * table, new dest
7996 kCycles for 100 * cmp al, in place
2952 kCycles for 100 * xlat, new dest
8216 kCycles for 100 * LevelUp, in place

8272 kCycles for 100 * cmp al, new dest
2679 kCycles for 100 * table, new dest
8048 kCycles for 100 * cmp al, in place
2961 kCycles for 100 * xlat, new dest
8161 kCycles for 100 * LevelUp, in place

8218 kCycles for 100 * cmp al, new dest
2691 kCycles for 100 * table, new dest
8011 kCycles for 100 * cmp al, in place
2957 kCycles for 100 * xlat, new dest
8178 kCycles for 100 * LevelUp, in place

75 bytes for cmp al, new dest
65 bytes for table, new dest
54 bytes for cmp al, in place
35 bytes for xlat, new dest
66 bytes for LevelUp, in place

--- ok ---

Intel(R) Core(TM)2 Duo CPU E8400 @ 3.00GHz (SSE4)

8418 kCycles for 100 * cmp al, new dest
2556 kCycles for 100 * table, new dest
8051 kCycles for 100 * cmp al, in place
2949 kCycles for 100 * xlat, new dest
2958 kCycles for 100 * table, source zero-delimited
2968 kCycles for 100 * table, source zero-delimited, stosb
7249 kCycles for 100 * szLower
1976 kCycles for 100 * szUpper

5304 kCycles for 100 * cmp al, new dest
2352 kCycles for 100 * table, new dest
8046 kCycles for 100 * cmp al, in place
3097 kCycles for 100 * xlat, new dest
2939 kCycles for 100 * table, source zero-delimited
2949 kCycles for 100 * table, source zero-delimited, stosb
7098 kCycles for 100 * szLower
1975 kCycles for 100 * szUpper

5284 kCycles for 100 * cmp al, new dest
2316 kCycles for 100 * table, new dest
8296 kCycles for 100 * cmp al, in place
3099 kCycles for 100 * xlat, new dest
3115 kCycles for 100 * table, source zero-delimited
2970 kCycles for 100 * table, source zero-delimited, stosb
7079 kCycles for 100 * szLower
1969 kCycles for 100 * szUpper

75 bytes for cmp al, new dest
37 bytes for table, new dest
54 bytes for cmp al, in place
35 bytes for xlat, new dest
28 bytes for table, source zero-delimited
28 bytes for table, source zero-delimited, stosb
7 bytes for szLower
7 bytes for szUpper

--- ok ---

The MASM Forum

General => The Laboratory => Topic started by: jj2007 on January 11, 2022, 01:06:27 PM