The MASM Forum

General => The Workshop => Topic started by: frktons on December 18, 2012, 04:20:28 AM

Title: It's Time to build a Text File Compressor in Masm32
Post by: frktons on December 18, 2012, 04:20:28 AM
After a little discussion about Data Compression (http://masm32.com/board/index.php?topic=1091.0) I think it's time
to build some tools in order to see how doable
it is in Masm32  a project of this complexity.

I'll post here, in this first post, the work in progress while I try to build it.

It'll not be easy & it'll not be fast. But somehow I've to start, as they say:
"A ten thousand miles trip begins with the first step" (http://masm32.com/board/index.php?action=dlattach;topic=1091.0;attach=926)

As usual any suggestion and help is welcome.

Frank.
Title: Re: It's Time to build a Text File Compressor in Masm32
Post by: dedndave on December 18, 2012, 04:56:53 AM
Frank,
for what you appear to be doing, i still think a vocabulary list is far superior to compression
you are very likely to use many of the same words over and over for all the "items"
this would seem to indicate that it is better to create a list, rather than making an algo that finds repeated words

let's say you start with a vocabulary of 4096 words and phrases
punctuation, like periods, carriage returns, etc, would each be a seperate word
if you use 4096, that means that each word or phrase would be 12 bits in the "compressed" data stream

each of those 12-bit values is an index into a look-up table
the look-up table then yields an offset and a length into the dictionary
if you assume a dictionary of 4 kb max and a max phrase length of 17 bytes, each table entry will be 16 bits

so, you have a look-up table size of 8 kb max (you probably won't use all 4096 entries)
you have a dictionary of size 4 kb max (which might even be compressed)

once you have those in place, each of the individual text items "compresses" to a size of 12 bits per word/phrase
if we use 6 bytes per word/phrase (just a guess), then each item compresses to 25% of its' original size

the decompression code would be very small, fast, and easy to write
Title: Re: It's Time to build a Text File Compressor in Masm32
Post by: dedndave on December 18, 2012, 05:11:49 AM
let''s make some guesses....

let's estimate that your table and dictionary, combined, consume 8 kb (8192)
let's estimate that you have 100 items at 1200 bytes each (120000 total)
let's estimate that each item compresses to 300 bytes (30000 total)

(8192 + 30000) / 120000 ~ 31.8 % of the original size
that's without compressing the dictionary

while you may be able to beat that with some other form of compression,
the code to decompress it is going to be substantial, and probably slower
and - you won't beat the ratio by much   :P
Title: Re: It's Time to build a Text File Compressor in Masm32
Post by: dedndave on December 18, 2012, 05:55:55 AM
oh....
and, if you feel like you might need more that 4096 bytes for the dictionary,
use 8192 bytes for the dictionary and a max word length of 9 bytes
(length value of 0 means length = 1 byte)
if you need a longer word or phrase - so what, it uses 2 entries   :P
that would still get you the 12 bits per word

regardless of which method you use, words like "MOVSHDUP" and "MOVSLDUP"
can be made by combining shorter words like "MOVS", "H", "L", "DUP"

in the dictionary, you might have "MOVSHDUP" as one word
then, in the table, create indexes using the same dictionary string to make "MOV", "MOVS", "H", "DUP"
just an example - you get the idea
Title: Re: It's Time to build a Text File Compressor in Masm32
Post by: jj2007 on December 18, 2012, 06:48:15 AM
I guess as a coding exercise it could become a nice little project. But let's not get too serious - here (http://www.maximumcompression.com/data/summary_mf.php#data) you can find a ranking of major compressors. If we can beat WinZip's maximum non-portable setting, we have a chance to land on rank 180. You can only speculate how many man-years of professional programming are on the 179 ranks before ;-)

Since you are much more a fan of mathematics than I am, you might understand what Matt Mahoney (http://mattmahoney.net/dc/dce.html#Section_Conclusion) writes. Impressive ;-)
Title: Re: It's Time to build a Text File Compressor in Masm32
Post by: frktons on December 18, 2012, 07:07:03 AM
It's a coding exercise, of course. I don't think I'm going
to beat mathematicians and IT people that have dedicated
their entire life to these subjects.

Nevertheless If I get some good results, it would be nice
indeed.  :P

Dave, you are probably right, actually I don't know for sure.  ::)
Along the way we can test your ideas and mine and other
people's as well.  :t

Title: Re: It's Time to build a Text File Compressor in Masm32
Post by: jj2007 on December 18, 2012, 07:38:48 AM
Building the dictionary is probably not too difficult, but some "architectural" decisions are needed.
For ex, setting bit 8 for "index is in bits 0-6" is fine but is a 127 entry dictionary efficient?
- What about going for 15 bits = 32767 entries by setting bit 8 for "index is in bits 0-6 and 6-15"? It costs two bytes instead of one but your table can hold 256 as many words.
- How to deal with frequently used single characters or two-byte combis, like "a" or "a " or ", "?
- Or tabs and spaces in assembler code? RLE triggered by what?
  ....
Title: Re: It's Time to build a Text File Compressor in Masm32
Post by: MichaelW on December 18, 2012, 08:07:03 AM
Instead of fixed-length encoding, why not variable-length encoding where words with a higher probability get shorter encodings?
Title: Re: It's Time to build a Text File Compressor in Masm32
Post by: dedndave on December 18, 2012, 08:19:02 AM
there are no special bits - other than, a value of 0 could be used to indicate the end or "unused 12 bit entry"
maybe my description wasn't stated very clearly

in the compressed stream, each "word" uses 12 bits (2 words for every 3 bytes)
once seperated, that 12-bit value is zero-extended and used as an index into the look-up table
each entry in the look-up table is a word, so the 12-bit value would be doubled

i think my later example is a little better, so let's use that
once you have the word from the LUT, it is split into 2 parts: the index and the length
we are using 13 bits for the index and 3 bits for the length
if the length bits are 000, the string has a length of 1
if the length bits are 111, the string has a length of 8
the other 13 bits are used as an index into a 8 kB dictionary
this tells you where the beginning of the word is
you go to that offset in the dictionary and grab <length> bytes and copy them into the output stream

in other words, the dictionary can be a "mingling" of words and phrases
a "word" or "phrase" does not have to be a word of text, it can be part of a word or 2 words
certain areas of the dictionary may be used to make more than 1 word
if the entire dictionary were...
db 'MOVSHDUP'
by creating different LUT entries, you can create "MOV", "MOVS", "DUP", as well as a few others
Title: Re: It's Time to build a Text File Compressor in Masm32
Post by: dedndave on December 18, 2012, 08:35:13 AM
to create the dictionary, you would combine the text of all the items
then, look for 8-byte strings that are used multiple times
each time you find a new 8-byte string you search the current dictionary to see if it can be made with existing data
if so, you add a new entry into the LUT
if not, you add the 8-byte string to the dictionary and create an LUT entry
each instance of the 8-byte string in the combined source is set to 0's

when you find no more 8-byte strings that are used multiple times, you do 7-byte strings, and so on
when you get down to 1-byte strings, you will likely find each character already in the dictionary
if not, you create a single-byte entry in the dictionary

now, you have the LUT and the dictionary

using that, you compress each item individually
they all use the same LUT and the same dictionary

it may take a little while to do the compression part
you don't care - only the decompression time is really important
and, you can see that the decompression code is going to be very small and simple
Title: Re: It's Time to build a Text File Compressor in Masm32
Post by: dedndave on December 18, 2012, 09:03:32 AM
we could use the text from Ray's FPU tutorial for the test
not necessarily all of it - just the parts where individual instructions are explained
each instruction is an "item"
that should be somewhat similar to what Frank intends to do   :P
i don't think he has written all his "items" yet
Title: Re: It's Time to build a Text File Compressor in Masm32
Post by: frktons on December 18, 2012, 09:57:31 AM
Quote from: dedndave on December 18, 2012, 09:03:32 AM
we could use the text from Ray's FPU tutorial for the test
not necessarily all of it - just the parts where individual instructions are explained
each instruction is an "item"
that should be somewhat similar to what Frank intends to do   :P
i don't think he has written all his "items" yet

Exactly Master. So far I'm not sure about the content of the text
to pack inside the program.

If I only had to display some Intel manual pages, explaining single instrucions,
maybe phrases could be somewhat similar.

But I used a page of the Intel Manual only because I had to consult it
to use a couple of SSE instructions during my latest tests.

Something that makes me doubt about the dictionary
is the concept of "phrase" you intend to put inside a dictionary.

Maybe the text content could be technical help, or explanation
of how to use MasmBasic [the latter being an idea in my mind]
or other stuff I actually don't know.

For the time being I just try some generic english text to see
what happens. When I'll post some code you could use your
ideas and experience to make it better, or to implement your
dictionary concept, and we'll see how it fits.
Title: Re: It's Time to build a Text File Compressor in Masm32
Post by: frktons on December 18, 2012, 11:30:55 AM
Before going to next step, the Menu Management of the program,
I need your test for Font and size of what you'll see running the
prog attached to the first post of this thread.

Let me know if it fits into your screen, and if you have any problem
displaying it.

Frank
Title: Re: It's Time to build a Text File Compressor in Masm32
Post by: dedndave on December 18, 2012, 11:58:17 AM
oops - i think you forgot the attachment   :redface:

i figured you were doing "items" for each of the MMX/SSE instructions
if you are going to expand beyond that, you might have to increase the dictionary and/or LUT sizes
if that's the case, you will not get as good of compression ratio
and - it might be just as well to use some other form of data compression

the example i have shown worked because of the similarities in text between individual instruction descriptions
Title: Re: It's Time to build a Text File Compressor in Masm32
Post by: frktons on December 18, 2012, 12:18:33 PM
Quote from: dedndave on December 18, 2012, 11:58:17 AM
oops - i think you forgot the attachment   :redface:

i figured you were doing "items" for each of the MMX/SSE instructions
if you are going to expand beyond that, you might have to increase the dictionary and/or LUT sizes
if that's the case, you will not get as good of compression ratio
and - it might be just as well to use some other form of data compression

the example i have shown worked because of the similarities in text between individual instruction descriptions

Dave. You don't really read my posts, do you?  :lol:

Try to read again my previous post and you'll find the attachments
of the trade.  :P

Quote from: dedndave on December 18, 2012, 09:03:32 AM
we could use the text from Ray's FPU tutorial for the test not necessarily all of it

I've chosen a more suitable text: a tutorial on data compression from the author of
one of the Apple II software: ShrinkIt as a text to experiment with.  :biggrin:

The idea of a dictionary is a good one. We only have to see and adapt the idea to
the real data. In a long text, in particular technical ones, the same words tend to be used
hundreds of times. So a dictionary with priority for most frequent/long words is in itself
quite nice.
Title: Re: It's Time to build a Text File Compressor in Masm32
Post by: dedndave on December 18, 2012, 12:47:26 PM
the console app looks nice
but, wouldn't it be easier to make it a GUI app ?
no messing with all those line chars and the console window bugs
Title: Re: It's Time to build a Text File Compressor in Masm32
Post by: frktons on December 18, 2012, 02:05:14 PM
Quote from: dedndave on December 18, 2012, 12:47:26 PM
the console app looks nice
but, wouldn't it be easier to make it a GUI app ?
no messing with all those line chars and the console window bugs


Actually I refuse to learn GUI stuff, and I prefer to code in old console style.
There are only a bounce of APIs for the console, against the hundreds for
the GUI. Maybe in the future. For the time being I have too many things to learn
and too short time.
Title: Re: It's Time to build a Text File Compressor in Masm32
Post by: Donkey on December 18, 2012, 02:53:50 PM
Holy cow, I've been reading through a few papers on dictionary algorithms specifically the LZ adaptive dictionary-based group of algorithms. Its going to take a while to wrap my head around this stuff, I have quite a bit of studying to do before I can add much to the discussion but I think a permutation of the LZFG algorithm is the way to go. I'm currently reading through the following

http://wisdombasedcomputing.com/vol1issue3december2011/paper34.pdf
http://everything2.com/title/suffix+tree
Title: Re: It's Time to build a Text File Compressor in Masm32
Post by: frktons on December 18, 2012, 03:18:49 PM
Quote from: Donkey on December 18, 2012, 02:53:50 PM
Holy cow, I've been reading through a few papers on dictionary algorithms specifically the LZ adaptive dictionary-based group of algorithms. Its going to take a while to wrap my head around this stuff, I have quite a bit of studying to do before I can add much to the discussion but I think a permutation of the LZFG algorithm is the way to go. I'm currently reading through the following

http://wisdombasedcomputing.com/vol1issue3december2011/paper34.pdf
http://everything2.com/title/suffix+tree

Probably the attachment to the first post of this thread could be interesting
for your studies then.
Title: Re: It's Time to build a Text File Compressor in Masm32
Post by: frktons on December 19, 2012, 02:19:54 PM
Added some code for Menu Management.

Actually the working keys are:

F1 / H for Help
ESC / E for Exit
Arrows to move from a menu item to another
PagUP/Down first/last menu item
Home/End as above
Numbers / PAD-Numbers to select corresponding menu item
C - Copies the text from the screen displayed into the clipboard
S - Saves the screen with its own format and colors
V - Partially implemented to view screen file for the time being

As usual the update is in the first post.

The procs for loading / scanning / comparing files are under construction.
(http://masm32.com/board/index.php?action=dlattach;topic=1091.0;attach=926)
Title: Re: It's Time to build a Text File Compressor in Masm32
Post by: jj2007 on December 19, 2012, 05:29:57 PM
Quote from: frktons on December 19, 2012, 02:19:54 PM
As usual the update is in the first post.

Hi Frank,
Could you use archives with folder names, e.g. \Masm32\Misc\Compressor? With lots of files, it's a nuisance to look every time for the right folder in WinZip etc...
Thanks,
JJ
Title: Re: It's Time to build a Text File Compressor in Masm32
Post by: frktons on December 19, 2012, 08:27:53 PM
Quote from: jj2007 on December 19, 2012, 05:29:57 PM

Hi Frank,
Could you use archives with folder names, e.g. \Masm32\Misc\Compressor? With lots of files, it's a nuisance to look every time for the right folder in WinZip etc...
Thanks,
JJ

Hi Jochen,

Yes we can.  :t
From next update.
Title: Re: It's Time to build a Text File Compressor in Masm32
Post by: TouEnMasm on December 20, 2012, 03:27:12 AM
This one is still in the http://www.masmforum.com/board/index.php?topic=15470.msg127506#msg127506 (http://www.masmforum.com/board/index.php?topic=15470.msg127506#msg127506) old forum.
Title: Re: It's Time to build a Text File Compressor in Masm32
Post by: dedndave on December 20, 2012, 03:29:30 AM
and, a few posts down from that one   :P
Title: Re: It's Time to build a Text File Compressor in Masm32
Post by: frktons on December 20, 2012, 03:46:55 AM
Having a cab.exe is the dream I was trying to avoid.
Maybe the sources could be more interesting.

Anyway, some updates for TFSC:

- added mouse to manage the menu
- allocated 2 buffers 1 mb each for file comparing or storing/compressing.
- added routine [to be completed] for file comparing.

Frank
Title: Re: It's Time to build a Text File Compressor in Masm32
Post by: FORTRANS on December 20, 2012, 08:47:06 AM
Quote from: Donkey on December 18, 2012, 02:53:50 PM
Holy cow, I've been reading through a few papers on dictionary algorithms specifically the LZ adaptive dictionary-based group of algorithms. Its going to take a while to wrap my head around this stuff, I have quite a bit of studying to do before I can add much to the discussion but I think a permutation of the LZFG algorithm is the way to go. I'm currently reading through the following

http://wisdombasedcomputing.com/vol1issue3december2011/paper34.pdf
http://everything2.com/title/suffix+tree

Hi,

   Had not heard of LZFG.  Its performance is surprisingly good
according to that paper.  Sounds complex though from their
comments.

Thanks,

Steve
Title: Re: It's Time to build a Text File Compressor in Masm32
Post by: nidud on January 07, 2013, 08:34:34 AM
deleted
Title: Re: It's Time to build a Text File Compressor in Masm32
Post by: frktons on January 07, 2013, 10:29:39 AM
Quote from: nidud on January 07, 2013, 08:34:34 AM
Not shore how you going to attack this, but I assume you need to create a token list of equal strings, so here is something to play with:

include io.inc
include stdio.inc

MINSTRING equ 3

.data

input label byte
incbin <srctxt>
isize equ $ - offset input

token db isize dup(?)
tokenz dd ?
maxlen dd ?

.code

longest_match:
xor eax,eax ; find the longest string
mov ebx,eax ; return:
mov edx,eax ;  EBX: length of string
lea edi,token ;  EDX: offset in token buffer
mov ecx,tokenz
cmp ecx,1
ja scan
ret
      scan:
mov     al,[esi]
repnz scasb
je @F
ret
      @@:
      push edi
push esi
push ecx
inc esi
repe cmpsb
je @F
dec edi
dec esi
      @@:
pop ecx
mov     eax,esi
pop esi
pop edi
sub eax,esi
cmp eax,ebx
jb @F
mov ebx,eax
mov edx,edi
dec edx
      @@:
      cmp ecx,1
ja scan
ret

tokenize:
mov esi,offset input
mov edi,offset token
mov ecx,MINSTRING
mov tokenz,ecx
rep movsb
    tokenize_loop:
      call longest_match
cmp ebx,MINSTRING
jae tokenize_match
mov eax,tokenz
mov edi,offset token
mov ecx,edi
cmp esi,ecx
ja tokenize_end
add edi,eax
sub ecx,esi
cmp ecx,MINSTRING
jb tokenize_break
mov ecx,MINSTRING
add tokenz,ecx
add eax,ecx
cmp eax,isize
jae tokenize_rest
rep movsb
jmp tokenize_loop
    tokenize_match:
      cmp ebx,maxlen
jb @F
mov maxlen,ebx
      @@:
mov ecx,offset token
cmp esi,ecx
ja tokenize_end
sub ecx,esi
cmp ecx,ebx
jb tokenize_break
add esi,ebx
jmp tokenize_loop
    tokenize_break:
mov edi,tokenz
add edi,offset token
    tokenize_rest:
    rep movsb
    tokenize_end:
ret

main proc c uses esi edi ebx
; mov [eax],eax
call tokenize
invoke printf,"\ninput:\t%7d\noutput:\t%7d\nmaxlen:\t%7d\n",isize,tokenz,maxlen
.if osopen("token",_A_NORMAL, M_WRONLY, A_CREATETRUNC) != -1
    mov esi,eax
    invoke oswrite,esi,addr token,tokenz
    invoke close,esi
.endif
sub eax,eax
ret
main endp

end


The dictionary using minimum 3 byte length is 15258 for the Data_Compression.txt file


input: 253568
output:   15258
maxlen:      20


The overall project is targeted at implementing the usual algos:
Huffman, LZ, LZW, Arithmetic... and see, afterwards, how to
create something new, and hopefully faster or with a superior
compression ratio.
Faster could be possible thanks to some Assembly tricks, superior
as compression ratio will depend on many things, actually not tested.

Title: Re: It's Time to build a Text File Compressor in Masm32
Post by: nidud on January 07, 2013, 10:45:08 PM
deleted
Title: Re: It's Time to build a Text File Compressor in Masm32
Post by: frktons on January 08, 2013, 12:47:50 AM
Quote from: nidud on January 07, 2013, 10:45:08 PM
This is one way of reading bits from the input stream:


.data

bb dd ? ; bit buffer
bk db ? ; number of bits in bb
ios_i dd ? ; index in input stream

.code

getbits:
cmp bk,al
jb @F
mov cl,al
mov eax,1 ; create mask (a mask table is maybe better..)
shl eax,cl
dec eax
and eax,bb ; bits to EAX
sub bk,cl ; dec bit count
shr bb,cl ; dump used bits
inc cl ; set ZF flag
ret
      @@:
push eax ; add a byte to bb
mov eax,ios_i
cmp eax,isize
je @F ; eof..
inc ios_i
add eax,offset input
movzx eax,byte ptr [eax]
mov cl,bk
shl eax,cl
or bb,eax
add bk,8
pop eax
jmp getbits
      @@:
pop eax
    ret



Thanks nidud, these suggestions will soon be useful for
the compression task.  :t