Author Topic: Scanning a dictionary  (Read 370 times)

jj2007

  • Member
  • *****
  • Posts: 13944
  • Assembly is fun ;-)
    • MasmBasic
Scanning a dictionary
« on: February 03, 2023, 12:03:43 PM »
I found a real life example of a dictionary here. Download, open in 7-zip, in several steps until you reach eng-deu.dict

Example:
Quote
paring cutters/ploughs /pˈeəɹɪŋ kˈʌtəz plˈaʊz/
Flachwender <pl>
   Synonyms: {skim ploughs}, {turf cutters/ploughs}

 see: {skim plough}, {turf cutter/plough}, {paring cutter/plough}

flattening /flˈatənɪŋ/
Flachwerden durch Niederdrücken [textil.]
fuckwit /fˈʌkwɪt/
 [vulg.] Flachwichser <masc> [vulg.]
         Note: Dummkopf
         Note: fool

Looks like fun. Who writes the fastest algo to find e.g. fuckw in this 80MB, two Million lines file?

They also have an API to access the dictionaries.

hutch--

  • Administrator
  • Member
  • ******
  • Posts: 10583
  • Mnemonic Driven API Grinder
    • The MASM32 SDK
Scanning a dictionary
« Reply #1 on: February 03, 2023, 01:53:12 PM »
It probably has a lot to do with how the dictionary is laid out. If the words are alphabetical, you could get to the first "f" offset by having its address in a table, then linear scan the words until you either find it or not. If I had to do a task like that I would set up a table with each alphabetical character as an offset to a list. If the lists are sorted as they should be, you would only need to scan the second character before you started to scan the whole word.
hutch at movsd dot com
http://www.masm32.com    :biggrin:  :skrewy:

jj2007

  • Member
  • *****
  • Posts: 13944
  • Assembly is fun ;-)
    • MasmBasic
Re: Scanning a dictionary
« Reply #2 on: February 03, 2023, 08:54:00 PM »
It probably has a lot to do with how the dictionary is laid out. If the words are alphabetical, you could get to the first "f" offset by having its address in a table, then linear scan the words until you either find it or not. If I had to do a task like that I would set up a table with each alphabetical character as an offset to a list. If the lists are sorted as they should be, you would only need to scan the second character before you started to scan the whole word.

Yep, exactly. I would go for a two-character table (26*26=676 DWORDs). If I had too much time at hand, I would try it, but I made a quick test with RichMasm, and finding 43 matches for flibbertigibbet in that 80MB document takes only about 100 milliseconds, so a proper solution is probably not worth the time :tongue: