Scanning a dictionary

jj2007 · February 03, 2023, 12:03:43 PM

I found a real life example of a dictionary here. Download, open in 7-zip, in several steps until you reach eng-deu.dict

Example:

Quoteparing cutters/ploughs /pËˆeÉ™É¹ÉªÅ‹ kËˆÊŒtÉ™z plËˆaÊŠz/
Flachwender <pl>
Synonyms: {skim ploughs}, {turf cutters/ploughs}

see: {skim plough}, {turf cutter/plough}, {paring cutter/plough}

flattening /flËˆatÉ™nÉªÅ‹/
Flachwerden durch NiederdrÃ¼cken [textil.]
uckwit /fËˆÊŒkwÉªt/
[vulg.] Flachwichser <masc> [vulg.]
Note: Dummkopf
Note: fool

Looks like fun. Who writes the fastest algo to find e.g. uckw in this 80MB, two Million lines file?

They also have an API to access the dictionaries.

hutch-- · February 03, 2023, 01:53:12 PM

It probably has a lot to do with how the dictionary is laid out. If the words are alphabetical, you could get to the first "f" offset by having its address in a table, then linear scan the words until you either find it or not. If I had to do a task like that I would set up a table with each alphabetical character as an offset to a list. If the lists are sorted as they should be, you would only need to scan the second character before you started to scan the whole word.

jj2007 · February 03, 2023, 08:54:00 PM

Quote from: hutch-- on February 03, 2023, 01:53:12 PM
It probably has a lot to do with how the dictionary is laid out. If the words are alphabetical, you could get to the first "f" offset by having its address in a table, then linear scan the words until you either find it or not. If I had to do a task like that I would set up a table with each alphabetical character as an offset to a list. If the lists are sorted as they should be, you would only need to scan the second character before you started to scan the whole word.

Yep, exactly. I would go for a two-character table (26*26=676 DWORDs). If I had too much time at hand, I would try it, but I made a quick test with RichMasm, and finding 43 matches for flibbertigibbet in that 80MB document takes only about 100 milliseconds, so a proper solution is probably not worth the time

The MASM Forum

News:

Scanning a dictionary

jj2007

hutch--

jj2007