I'm still working on my search/find routine. I want to do a byte by byte comparison. If I can't do ".if byte ptr [eax]==byte ptr [edi]", what is the best alternative? I already have a workaround by just moving [edi] to dl. Is there an easy way to eliminate the step? I'd also like to keep the edx register open for other things. How does cmpsb do it?
Quote from: Ryan on June 08, 2012, 10:43:10 PM
I'm still working on my search/find routine. I want to do a byte by byte comparison. If I can't do ".if byte ptr [eax]==byte ptr [edi]", what is the best alternative? I already have a workaround by just moving [edi] to dl. Is there an easy way to eliminate the step? I'd also like to keep the edx register open for other things. How does cmpsb do it?
mov al,[esi]
.if al == [edi]
.endif
cmpsb does mainly the same thing but it is slow...
You need to explain more / better what exactly do you want to compare?
Two strings? one substring inside a string? A char inside a string?
If cmpsb moves [esi] to al, then I guess there's no way around it. No big deal. I was just curious. Thanks.
No cmpsb doesn't use al, but it needs both esi and edi, of course. As regards "slow": That may depend on your CPU. Test yourself 8)
AMD Athlon(tm) Dual Core Processor 4450B (SSE3)
235 cycles for cmpsb
518 cycles for cmp [esi+ecx]
313 cycles for cmp [esi]
235 cycles for cmpsb
517 cycles for cmp [esi+ecx]
314 cycles for cmp [esi]
I'm not the kind of counting clocks and wasnt aware it was slow. I've done some compare routines more or less on these lines:
Compare PROC lpSource:DWORD,cbSource:DWORD,lpCompare:DWORD
mov esi,lpSource
mov edi,lpCompare
mov ecx,cbSource ; the size of the source string
compare:
cmpsb
jnz no_match
dec ecx
jcxz match
jmp compare
match:
mov eax,TRUE
ret
no_match:
mov eax,FALSE
ret
Ryan,
Bogdan is right here, the old string instructions are generally slow but they also have the irritation of being locked into specific registers which may not fit into the rest of the code you need to write. The mechanics of doing string comparisons vary on what you are doing, for a normal search you load the start character in one register then compare it to a memory operand pointed to by BYTE PTR.
IF 0 ; ¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤
Build this template with "CONSOLE ASSEMBLE AND LINK"
ENDIF ; ¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤
include \masm32\include\masm32rt.inc
.data?
value dd ?
.data
txt db "one two three four five six seven", 0
.code
start:
; ¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤
call main
inkey
exit
; ¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤
main proc
mov eax, "f" ; the start character to find
mov edx, OFFSET txt ; the address of the text to find it in
sub edx, 1
lbl0:
add edx, 1
cmp BYTE PTR [edx], 0 ; is it the zero terminator ?
je outa_here
cmp BYTE PTR [edx], al ; compare against the BYTE sized part of EAX
jne lbl0
; in a search algo you would branch here to compare the
; rest of the search word to the current text location
; then return to lbl0 if it does not match
outa_here:
ret
main endp
; ¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤
end start
REPZ CMPSB might not be as bad as you think :P
we should probably do some comparisons
at any rate, have a look at Hutch's szCmp and szCmpi routines in the \masm32\m32lib folder
See above, Reply #3 :biggrin:
yah - that is what i might expect
if we were doing REPZ CMPSD on aligned DWORD's, things might be different
prescott w/htt
Intel(R) Pentium(R) 4 CPU 3.00GHz (SSE3)
100 cmpsb
100 [esi+ecx]
100 [esi]
500 cycles for cmpsb
448 cycles for cmp [esi+ecx]
544 cycles for cmp [esi]
496 cycles for cmpsb
448 cycles for cmp [esi+ecx]
541 cycles for cmp [esi]
string length may play an important role
although, we rarely compare strings longer than, say, 100 bytes or so
If cmpsb doesn't use al, how does it do the comparison? I've tried "cmp byte ptr [eax], byte ptr [edx]", but it doesn't work. I assume it must be using some other method to do the comparison.
Ryan,
That will not work because there is no mnemonic in an x86 processor that will directly compare memory to memory, you must load at least one into a register. CMP BYTE PTR [ESI], AL
i suppose CMPSB loads a byte from [ESI] into a temporary register
I'm not sure what to think. jj says cmpsb doesn't use al, but Hutch and Dave allude to the possibility that it might?
no - the CMPS instructions do not use AL/AX/EAX :biggrin:
they perform the equivalent of
cmp [esi],[edi]
in byte, word, or dword form, as applicable
then, they adjust the index registers ESI and EDI
i mentioned a temporary register
that is a register (or plural) that the CPU uses for certain operations
the reason i say that is - i don't think there is any way for the hardware DMA to compare values at two different addresses - it's not part of the design
i.e., the CPU has to get one of the values into an internal register to make the comparison
One more test with extra long strings:
Intel(R) Celeron(R) M CPU 420 @ 1.60GHz (SSE3)
10000 cmpsb
10000 [esi+ecx]
10000 [esi]
40124 cycles for cmpsb
30076 cycles for cmp [esi+ecx]
31215 cycles for cmp [esi]
40127 cycles for cmpsb
30055 cycles for cmp [esi+ecx]
31208 cycles for cmp [esi]
On the AMD, cmpsb was clearly faster, but on Intel it is exactly 33% slower.
must be "opposite day" :biggrin:
prescott w/htt
Intel(R) Pentium(R) 4 CPU 3.00GHz (SSE3)
10000 cmpsb
10000 [esi+ecx]
10000 [esi]
45595 cycles for cmpsb
53307 cycles for cmp [esi+ecx]
60248 cycles for cmp [esi]
45436 cycles for cmpsb
53530 cycles for cmp [esi+ecx]
57437 cycles for cmp [esi]
Intel(R) Core(TM)2 Quad CPU Q9650 @ 3.00GHz (SSE4)
10000 cmpsb
10000 [esi+ecx]
10000 [esi]
40343 cycles for cmpsb
20082 cycles for cmp [esi+ecx]
20089 cycles for cmp [esi]
40151 cycles for cmpsb
20159 cycles for cmp [esi+ecx]
20140 cycles for cmp [esi]
Hi,
Here are two more. Tried a third machine, but that hasn't
worked yet.
pre-P4 (SSE1)
10000 cmpsb
10000 [esi+ecx]
10000 [esi]
45371 cycles for cmpsb
32006 cycles for cmp [esi+ecx]
40431 cycles for cmp [esi]
45366 cycles for cmpsb
32014 cycles for cmp [esi+ecx]
40435 cycles for cmp [esi]
--- ok ---
pre-P410000 cmpsb
10000 [esi+ecx]
10000 [esi]
48511 cycles for cmpsb
58212 cycles for cmp [esi+ecx]
58046 cycles for cmp [esi]
48616 cycles for cmpsb
58051 cycles for cmp [esi+ecx]
58045 cycles for cmp [esi]
--- ok ---
Regards,
Steve N.
Ryan,
> I'm not sure what to think. jj says cmpsb doesn't use al, but Hutch and Dave allude to the possibility that it might?
Don't confuse CMP with CMPSB or similar, CMPSB is one of the antique string instructions where CMP is a fast integer operation. CMP can use any integer register where CMPSB is locked into specific registers. Its basically old DOS junk that is left in Intel processors for backwards compatibility.
There is special case circuitry in most later Intel processors to speed up some of the old string instructions MOVS CMPS LODS etc ... but it is specific to the use of the REP(?) prefix family with them and it locks you into specific register usage, the reason why I recommend that you code your comparisons using instructions like CMP instead.
Thanks Hutch.
I already took your suggestion in the other thread and used straight comparisons. I also want my search to be case insensitive, so it's easier for me to check for mixed case one by one.
My question in this thread was about the possibility of not having to use a register when comparing byte to byte. It appears the answer is it is not. I already know the solution; I'll just have to push/pop an additional register to make it happen.
Hutch has a function named szCmpi that is case insensitive :t
it's probably pretty fast, too
Quote from: hutch-- on June 09, 2012, 08:52:26 AMCMPSB is locked into specific registers. Its basically old DOS junk
Wait a second: Take the worst case above, 4 cycles per byte of comparison. With a 2GHz CPU, analysis of a 500MB file takes 500000000*4/2000000000 = 1 second.
It's good to be dogmatic when writing libraries, but for a real life app check first if the simplest instruction is really too "slow" :biggrin:
Pick a processor, pick a result, I used to own a 550 meg AMD years ago that was really fast on the old string instructions but very ordinary on the lower level integer instructions, since then you had at least 2 families of PIIIs, 3 families of PIVs, Core series duos and quads and the current i3/5/7s and this is only with Intel processors. Over a wide average the old string instructions live in microcode, not in the fast intrinsic instructions. Then you have an antique architecture locked into the source and destination index ALA 16 bit 8088 DOS code.
The only place for the old string instructions is in compatibility mode where special case circuitry make REP(?) prefixed version fast AFTER a specific byte count. The price of locking yourself into old DOS junk is buying its ancient architecture, something like trying to tune the last couple of horsepower out of a T model Ford.
just so you know...
not everyone feels that way - lol
in my opinion, string instructions have their place
it's one of the many tools in the toolbox
try to use the proper tool for the job
:biggrin:
> in my opinion, string instructions have their place
The verb tense is wrong "had" reflects their real value, 16 bit DOS on an 8088 or 8086 running at 5 mhz. DOS only died 20 years ago.
Quote from: hutch-- on June 10, 2012, 01:06:05 AM
The verb tense is wrong "had" reflects their real value, 16 bit DOS on an 8088 or 8086 running at 5 mhz. DOS only died 20 years ago.
So true :biggrin:
7C8023BC ³. 8D7D CC lea edi, [ebp-34]
7C8023BF ³. AB stosd
:shock:
i can't believe you guys don't agree with me :P
Quote from: dedndave on June 10, 2012, 07:07:18 AM
i can't believe you guys don't agree with me :P
Well, partly I do. I should remember that I have to check on Monday if the Win7 Kernel still has that 20 year old
DOSD stosd ;)
STOSD, STOSB, MOVSB, MOVSD and friends are NOT slow.
But the thread was about comparing and CMPSB is slow ;)
we timed it earlier in the thread, boggie
it's not all that bad
very machine-dependant results
Um... the thread was an inquiry about how cmpsb compares two byte pointers. I wanted to know if I could use similar logic to keep a register open for other use.
Ryan,
If you use CMPS you must use the registers it requires, if you perform the comparison using lower level mnemonics you have more freedom to choose the registers you have available. If you look up the Intel manual you will see that you must use the source index ESI and the destination index EDI to perform the comparison and used in conjunction with the REP prefix. Byte counts MUST be done in ECX.
From the manual,
Quote
The CF, OF, SF, ZF, AF, and PF flags are set according to the temporary result of the comparison.
I have tried to warn you that using CMPS is buying into old DOS junk and an antique architecture. In the 8088 - 8086 days you had no choice, with the advent of 32 bit on 486 and later you are free of these restrictions.
What you write has some to do with what you are after, if you are performing a normal scan for text, you normally load the 1st character of the text into a register then scan through the text data you are searching to find a single character match. When you find a character match you branch to another loop to compare the rest of the characters in the text being searched for. If it matches the search text you exit with the start offset of the result, else you return to the main loop and keep scanning the text being searched.
If you are performing a text comparison, you load both addresses and compare them byte by byte.