News:

Masm32 SDK description, downloads and other helpful links
Message to All Guests
NB: Posting URL's See here: Posted URL Change

Main Menu

Word count challenge

Started by jj2007, February 14, 2016, 03:30:52 PM

Previous topic - Next topic

jj2007

Task is to identify all unique words in bible.txt, and to sort them by frequency. Here is my first version - just under 50 lines:

include \masm32\MasmBasic\MasmBasic.inc      ; download
  SetGlobals match, words, eof
  Init
  NanoTimer()
  Let esi=FileRead$("bible.txt")      ; get it from the Canterbury Corpus
  Let esi=Spc4$+Lower$(esi)+Spc4$
  Print Str$("%i\tms for reading the file and converting to Lower$\n", NanoTimer(ms))
  NanoTimer()
  Dim w$(25)            ; a-z
  m2m eof, LastFileSize
  add eof, esi
  .Repeat
      lodsb                  ; get next byte from file content
      .if al>="a" && al<="z"      ; new word
            lea edi, [esi-2]
            mov byte ptr [edi], " "      ; start marker is blank
            movzx ecx, al      ; ecx=1st char
            .Repeat
                  lodsb
            .Until al<"a" || al>"z"      ; find end of word
            push dword ptr [esi-1]      ; save current end of word
            mov dword ptr [esi-1], 0a0dh      ; end marker is CrLf & 2 nullbytes
            sub ecx, "a"      ; we need index 0...25
            .if Instr_(FAST, w$(ecx), edi, 0)      ; e.g. 1000#his#0999#ext#1234#
                  inc match
                  lea edi, [eax-8]
                  .if Val(edi)<=99999
                        inc eax
                        movlps xmm0, qword ptr [Str$("%0000000i", eax)]
                        movlps qword ptr [edi], xmm0
                  .endif
            .else
                  inc words
                  Let w$(ecx)=w$(ecx)+"00000001"+edi
            .endif
            pop dword ptr [esi-1]      ; restore end of word
      .endif
  .Until esi>=eof
  Print Str$("%i\tms for counting\n", NanoTimer(ms))
  NanoTimer()
  Store "tmpwords.txt", w$()
  Recall "tmpwords.txt", w$()
  QSortDesc w$()
  Print Str$("%i\tms for sorting\n", NanoTimer(ms))
  For_ ecx=0 To 29
      PrintLine w$(ecx)
  Next
  Inkey Str$("%i duplicates", match-words), Str$(", %i unique words", words)
EndOfCode


Results for Intel i5 cpu on Win7-64:
28      ms for reading the file and converting to Lower$
361     ms for counting
5       ms for sorting
00061680 the
00049862 and
00033195 of
00013031 to
00012561 that
00012211 in
00009939 he
00009733 shall
00008821 for
00008808 unto
00008712 i
00008141 his
00007997 a
00007670 lord
00007141 they
00006931 be
00006879 is
00006423 not
00006388 him
00006257 them
00005949 it
00005860 with
00005430 all
00005370 thou
00004512 thy
00004388 god
00004319 was
00004305 my
00004263 which
00004034 me
742909 duplicates, 12473 unique words


Here is a snippet indicating where the bottleneck is - with the CRT, the analysis is a factor 3 slower:
if 1
push ecx
mov ecx, w$(ecx)
invoke crt_strstr, ecx, edi
pop ecx
.if eax
else
.if Instr_(FAST, w$(ecx), edi, 0) ; e.g. 1000#his#0999#ext#1234#
endif
:P

EDIT: Version 2 below multiplies frequency with word length, for use in a text compressor. Timings are the same, but the ranking favours long words:
00185040 the
00149586 and
00066390 of
00050244 that
00048665 shall
00035232 unto
00030680 lord
00028564 they

dedndave

counting lines of code using macros - really ?

jj2007

Quote from: dedndave on February 15, 2016, 11:16:03 PM
counting lines of code using macros - really ?

If you look at C/C++ code, do you count the lines of the disassembly?

dedndave

actually, i rarely count lines of code at all

but, you must be bored, my friend   :lol:

qWord

Quote from: jj2007 on February 14, 2016, 03:30:52 PM
Here is a snippet indicating where the bottleneck is - with the CRT, the analysis is a factor 3 slower
Evidently, the bottleneck is the wrong algorithm  :eusa_boohoo:
MREAL macros - when you need floating point arithmetic while assembling!

jj2007

You have a faster one? Great, post it.

qWord

Here it is that "crappy" C* stuff that easily beats your code:
Quote from: c++11Time total: 120.592ms
Time to load file: 9.67094ms
Time to count words: 98.1632ms
Time to sort: 0.547959ms
Time to save to file: 12.2095ms
Quote from: jj43      ms for reading the file and converting to Lower$
315     ms for counting
7       ms for sorting
MREAL macros - when you need floating point arithmetic while assembling!

jj2007


TWell

#8
How many words there are actually?
jj2007:     742909 + 12473 = 755382 words
qWord:     755382 + 12473 = 767855 words
MS Word 2010: 771669 words
LibreOffice 5.0: 766112 words
wc.exe:    766111 words,  wc from here

jj2007

qWord's and my results are actually identical, 767855 total words. Question is why MS Word finds so many more. My Word 2003 version, btw, finds 766112 words, which corresponds almost exactly to the wc.exe value.

The only way to find out would be to test small excerpts of the text. It is cumbersome, though. I've made a few tests, and sometimes my code finds even more words than MS Word.

hutch--

Guys, can we keep the claptrap out of the Lab, this is for code, not rhetoric.