News:

Masm32 SDK description, downloads and other helpful links
Message to All Guests

Main Menu

Scanning a dictionary

Started by jj2007, February 03, 2023, 12:03:43 PM

Previous topic - Next topic

jj2007

I found a real life example of a dictionary here. Download, open in 7-zip, in several steps until you reach eng-deu.dict

Example:
Quoteparing cutters/ploughs /pˈeəɹɪŋ kˈʌtəz plˈaʊz/
Flachwender <pl>
   Synonyms: {skim ploughs}, {turf cutters/ploughs}

 see: {skim plough}, {turf cutter/plough}, {paring cutter/plough}

flattening /flˈatənɪŋ/
Flachwerden durch Niederdrücken [textil.]
uckwit /fˈʌkwɪt/
 [vulg.] Flachwichser <masc> [vulg.]
         Note: Dummkopf
         Note: fool

Looks like fun. Who writes the fastest algo to find e.g. uckw in this 80MB, two Million lines file?

They also have an API to access the dictionaries.

hutch--

It probably has a lot to do with how the dictionary is laid out. If the words are alphabetical, you could get to the first "f" offset by having its address in a table, then linear scan the words until you either find it or not. If I had to do a task like that I would set up a table with each alphabetical character as an offset to a list. If the lists are sorted as they should be, you would only need to scan the second character before you started to scan the whole word.

jj2007

Quote from: hutch-- on February 03, 2023, 01:53:12 PM
It probably has a lot to do with how the dictionary is laid out. If the words are alphabetical, you could get to the first "f" offset by having its address in a table, then linear scan the words until you either find it or not. If I had to do a task like that I would set up a table with each alphabetical character as an offset to a list. If the lists are sorted as they should be, you would only need to scan the second character before you started to scan the whole word.

Yep, exactly. I would go for a two-character table (26*26=676 DWORDs). If I had too much time at hand, I would try it, but I made a quick test with RichMasm, and finding 43 matches for flibbertigibbet in that 80MB document takes only about 100 milliseconds, so a proper solution is probably not worth the time :tongue: