The MASM Forum

General => The Campus => Topic started by: mabdelouahab on April 29, 2015, 03:00:23 AM

Title: Compar Strings
Post by: mabdelouahab on April 29, 2015, 03:00:23 AM
Hello Forum members

I have 850 keyword; and I will make method have one parameter is string must compare with all keyword and return True if it is equal one of keyword or false if not ; what is the fastest way to make this work   , (max size 10 char,char= A to Z , and uppercase = lowercase, Param=Unicode string)
I have 10000 string to compare, must I give fast  algorithm
Title: Re: Compar Strings
Post by: nidud on April 29, 2015, 03:45:45 AM
deleted
Title: Re: Compar Strings
Post by: jj2007 on April 29, 2015, 04:45:25 AM
StringsDiffer (http://www.webalice.it/jj2006/MasmBasicQuickReference.htm#Mb1157) is reasonably fast, and with option 1 it's case-insensitive. 10,000 comparisons take about one millisecond.
Title: Re: Compar Strings
Post by: rrr314159 on April 29, 2015, 05:42:05 AM
Hi mabdelouahab,

Do you have SIMD (SSE, AVX) instructions? If so, no doubt it would be fastest to use vector operations. How fast must it be? If 100 nanoseconds is good enough and you can use MasmBasic then go with jj2007's recommendation
Title: Re: Compar Strings
Post by: habran on April 29, 2015, 05:50:00 AM
I would suggest a hash table 8)
Title: Re: Compar Strings
Post by: qWord on April 29, 2015, 06:00:41 AM
Quote from: habran on April 29, 2015, 05:50:00 AM
I would suggest a hush table 8)
Me too.
Because the keywords are constant, one could try to find a collision-free hash table by varying the seed value and/or the hash algorithm (in such case the search is in O(1)).
Title: Re: Compar Strings
Post by: rrr314159 on April 29, 2015, 06:04:12 AM
how to use a "hush table": if you want someone to shut up, hit him over the head with a table

OTOH if you want to compare strings you might use a hash table :biggrin:

But seriously - I think it depends how many matches u expect. Hashing will be 100% correct when it rejects the match, but can give a false match (say it matches when it really doesn't). SO if you expect a match to be rare, it's probably good. But not if you expect many matches; because every one has to be checked for authenticity.

It also matters whether the "keyword" mentioned is static (which it probably is, I guess). If it never changes that's good, but if it changes then two hashes must be computed for every comparison.

Bottom line, you're probably right Habran, hashing would be a good approach

[edit] I posted this b4 seeing qWord's post ... qWord you're probably right the keyword is constant, just wasn't sure of that from mabdelouahab's description. Can you really guarantee no collisions? As I say, as long as few matches are expected occasional collision shouldn't hurt much. But if you can find a collision-free table then doesn't matter how many matches expected. And, I guess you can do that if the keyword is constant - just make sure nothing else hashes to the keyword! Hmm, now that u mention it sounds good, never thought of that b4
Title: Re: Compar Strings
Post by: habran on April 29, 2015, 06:13:05 AM
Quotehow to use a "hush table": if you want someone to shut up, hit him over the head with a table
:lol::eusa_clap:
Title: Re: Compar Strings
Post by: qWord on April 29, 2015, 06:37:39 AM
Quote from: rrr314159 on April 29, 2015, 06:04:12 AMCan you really guarantee no collisions?
no guarantee, but you might be lucky and find such a table.
Some time back I've played with that using MurmurHash: at least for several hundred strings I got results by varying just the seed value.

EDIT: I did wrote something (completely) wrong here before.  :icon_redface:
Title: Re: Compar Strings
Post by: mabdelouahab on April 29, 2015, 06:42:13 AM
Quote from: habran on April 29, 2015, 05:50:00 AM
I would suggest a hash table 8)
Good idea, I forgot, I've worked on it a lot in school days
Quote from: rrr314159 on April 29, 2015, 06:04:12 AM
how to use a "hush table": if you want someone to shut up, hit him over the head with a table
:lol:
Quote from: rrr314159 on April 29, 2015, 06:04:12 AM

[edit] I posted this b4 seeing qWord's post ... qWord you're probably right the keyword is constant, just wasn't sure of that from mabdelouahab's description. Can you really guarantee no collisions? As I say, as long as few matches are expected occasional collision shouldn't hurt much. But if you can find a collision-free table then doesn't matter how many matches expected. And, I guess you can do that if the keyword is constant - just make sure nothing else hashes to the keyword! Hmm, now that u mention it sounds good, never thought of that b4
yes, the keyword is constant, it is Masm Keyword
I have completed work on the generator codes for com for masm, but I figured the problem of function names, and Parameter Names, Class Names, sometimes identical to one of the words
for example method Invoke of Idispatsh, sometimes Function Named Instr or "Type","Name", ...
Title: Re: Compar Strings
Post by: rrr314159 on April 29, 2015, 10:06:32 AM
mabdelouahab,

seems to me hash table is not necessary and it's a lot more coding than simple comparison.

And, probably 100 nanos / comparison is fast enough for you? A simple comparison routine with SIMD registers could be made even faster, prob 30 nanos or less.

So I go back to my first recommendation:

Quote from: rrr314159Do you have SIMD (SSE, AVX) instructions? If so, no doubt it would be fastest to use vector operations. How fast must it be? If 100 nanoseconds is good enough and you can use MasmBasic then go with jj2007's recommendation

If you need to code the routine just follow nidud's example but do it with SIMD registers. Of course you can figure it out - you're the man who figured out COM!  :t
Title: Re: Compare Strings
Post by: hutch-- on April 29, 2015, 10:20:04 AM
If you want speed in order of performance, a finite state machine (static tree) is the fastest, a well designed hash table performs OK and somewhere down the track normal string compares come a distant last.

What I would recommend is a first character lookup to a label address that splits the word range alphabetically so that if your first character is "g", you only have to scan a list of words starting with "g" rather than the whole list. If done correctly it will hit into about the middle range of performance without having to write complex algorithms.
Title: Re: Compar Strings
Post by: rrr314159 on April 29, 2015, 10:42:11 AM
hutch, you're right, and I finally understand what he's looking for.

Quote from: mabdelouahabI have 850 keyword...

I thought he meant one keyword with 850 characters! Obviously not a word, but some sort of long string. No, he means 850 keywords - with an s on the end - each one probably only 10 or so characters long. When I read your reply it made no sense, went back and read OP. That's the very first time I got tripped up by poor English on this board!

So, forget what I said about SIMD registers, I was talking about a completely different problem
Title: Re: Compar Strings
Post by: hutch-- on April 29, 2015, 11:55:33 AM
RE: Hash tables, you must have collision detection and if a hash collision occurs, you step to the next available slot in the hash table. You trade hashing speed with collision detection speed and if you get it right, you can fill a hash table to about 80% before you get noticable degradation in performance. Best to try for about 50% fill to maximise speed.
Title: Re: Compar Strings
Post by: jj2007 on April 29, 2015, 05:29:54 PM
Any volunteers for a nice testbed in the 10...100 MB range? I have no experience in hush tables. Hush hush 8)

SOF article for bloody beginners like me: Here's an explanation in layman's terms. (http://stackoverflow.com/questions/730620/how-does-a-hash-table-work)
Title: Re: Compar Strings
Post by: mabdelouahab on April 29, 2015, 06:16:30 PM
Quote from: hutch-- on April 29, 2015, 10:20:04 AM
If you want speed in order of performance, a finite state machine (static tree) is the fastest, a well designed hash table performs OK and somewhere down the track normal string compares come a distant last.

What I would recommend is a first character lookup to a label address that splits the word range alphabetically so that if your first character is "g", you only have to scan a list of words starting with "g" rather than the whole list. If done correctly it will hit into about the middle range of performance without having to write complex algorithms.
hutch
I think that I've been working on this , And I would  added:
initially Must hold 850 check, When checking string length (Max length = 10, mean lower check 850/10)= Keep about 85 check, Then check the first character (character number 25 A-Z, reducing verification 85/25)=Keep about 3 to 4 check (Gain some 97%)
But should we arrange table first