News:

Masm32 SDK description, downloads and other helpful links
Message to All Guests
NB: Posting URL's See here: Posted URL Change

Main Menu

It's Time to build a Text File Compressor in Masm32

Started by frktons, December 18, 2012, 04:20:28 AM

Previous topic - Next topic

frktons

After a little discussion about Data Compression I think it's time
to build some tools in order to see how doable
it is in Masm32  a project of this complexity.

I'll post here, in this first post, the work in progress while I try to build it.

It'll not be easy & it'll not be fast. But somehow I've to start, as they say:
"A ten thousand miles trip begins with the first step"

As usual any suggestion and help is welcome.

Frank.
There are only two days a year when you can't do anything: one is called yesterday, the other is called tomorrow, so today is the right day to love, believe, do and, above all, live.

Dalai Lama

dedndave

Frank,
for what you appear to be doing, i still think a vocabulary list is far superior to compression
you are very likely to use many of the same words over and over for all the "items"
this would seem to indicate that it is better to create a list, rather than making an algo that finds repeated words

let's say you start with a vocabulary of 4096 words and phrases
punctuation, like periods, carriage returns, etc, would each be a seperate word
if you use 4096, that means that each word or phrase would be 12 bits in the "compressed" data stream

each of those 12-bit values is an index into a look-up table
the look-up table then yields an offset and a length into the dictionary
if you assume a dictionary of 4 kb max and a max phrase length of 17 bytes, each table entry will be 16 bits

so, you have a look-up table size of 8 kb max (you probably won't use all 4096 entries)
you have a dictionary of size 4 kb max (which might even be compressed)

once you have those in place, each of the individual text items "compresses" to a size of 12 bits per word/phrase
if we use 6 bytes per word/phrase (just a guess), then each item compresses to 25% of its' original size

the decompression code would be very small, fast, and easy to write

dedndave

let''s make some guesses....

let's estimate that your table and dictionary, combined, consume 8 kb (8192)
let's estimate that you have 100 items at 1200 bytes each (120000 total)
let's estimate that each item compresses to 300 bytes (30000 total)

(8192 + 30000) / 120000 ~ 31.8 % of the original size
that's without compressing the dictionary

while you may be able to beat that with some other form of compression,
the code to decompress it is going to be substantial, and probably slower
and - you won't beat the ratio by much   :P

dedndave

oh....
and, if you feel like you might need more that 4096 bytes for the dictionary,
use 8192 bytes for the dictionary and a max word length of 9 bytes
(length value of 0 means length = 1 byte)
if you need a longer word or phrase - so what, it uses 2 entries   :P
that would still get you the 12 bits per word

regardless of which method you use, words like "MOVSHDUP" and "MOVSLDUP"
can be made by combining shorter words like "MOVS", "H", "L", "DUP"

in the dictionary, you might have "MOVSHDUP" as one word
then, in the table, create indexes using the same dictionary string to make "MOV", "MOVS", "H", "DUP"
just an example - you get the idea

jj2007

I guess as a coding exercise it could become a nice little project. But let's not get too serious - here you can find a ranking of major compressors. If we can beat WinZip's maximum non-portable setting, we have a chance to land on rank 180. You can only speculate how many man-years of professional programming are on the 179 ranks before ;-)

Since you are much more a fan of mathematics than I am, you might understand what Matt Mahoney writes. Impressive ;-)

frktons

It's a coding exercise, of course. I don't think I'm going
to beat mathematicians and IT people that have dedicated
their entire life to these subjects.

Nevertheless If I get some good results, it would be nice
indeed.  :P

Dave, you are probably right, actually I don't know for sure.  ::)
Along the way we can test your ideas and mine and other
people's as well.  :t

There are only two days a year when you can't do anything: one is called yesterday, the other is called tomorrow, so today is the right day to love, believe, do and, above all, live.

Dalai Lama

jj2007

Building the dictionary is probably not too difficult, but some "architectural" decisions are needed.
For ex, setting bit 8 for "index is in bits 0-6" is fine but is a 127 entry dictionary efficient?
- What about going for 15 bits = 32767 entries by setting bit 8 for "index is in bits 0-6 and 6-15"? It costs two bytes instead of one but your table can hold 256 as many words.
- How to deal with frequently used single characters or two-byte combis, like "a" or "a " or ", "?
- Or tabs and spaces in assembler code? RLE triggered by what?
  ....

MichaelW

Instead of fixed-length encoding, why not variable-length encoding where words with a higher probability get shorter encodings?
Well Microsoft, here's another nice mess you've gotten us into.

dedndave

there are no special bits - other than, a value of 0 could be used to indicate the end or "unused 12 bit entry"
maybe my description wasn't stated very clearly

in the compressed stream, each "word" uses 12 bits (2 words for every 3 bytes)
once seperated, that 12-bit value is zero-extended and used as an index into the look-up table
each entry in the look-up table is a word, so the 12-bit value would be doubled

i think my later example is a little better, so let's use that
once you have the word from the LUT, it is split into 2 parts: the index and the length
we are using 13 bits for the index and 3 bits for the length
if the length bits are 000, the string has a length of 1
if the length bits are 111, the string has a length of 8
the other 13 bits are used as an index into a 8 kB dictionary
this tells you where the beginning of the word is
you go to that offset in the dictionary and grab <length> bytes and copy them into the output stream

in other words, the dictionary can be a "mingling" of words and phrases
a "word" or "phrase" does not have to be a word of text, it can be part of a word or 2 words
certain areas of the dictionary may be used to make more than 1 word
if the entire dictionary were...
db 'MOVSHDUP'
by creating different LUT entries, you can create "MOV", "MOVS", "DUP", as well as a few others

dedndave

to create the dictionary, you would combine the text of all the items
then, look for 8-byte strings that are used multiple times
each time you find a new 8-byte string you search the current dictionary to see if it can be made with existing data
if so, you add a new entry into the LUT
if not, you add the 8-byte string to the dictionary and create an LUT entry
each instance of the 8-byte string in the combined source is set to 0's

when you find no more 8-byte strings that are used multiple times, you do 7-byte strings, and so on
when you get down to 1-byte strings, you will likely find each character already in the dictionary
if not, you create a single-byte entry in the dictionary

now, you have the LUT and the dictionary

using that, you compress each item individually
they all use the same LUT and the same dictionary

it may take a little while to do the compression part
you don't care - only the decompression time is really important
and, you can see that the decompression code is going to be very small and simple

dedndave

we could use the text from Ray's FPU tutorial for the test
not necessarily all of it - just the parts where individual instructions are explained
each instruction is an "item"
that should be somewhat similar to what Frank intends to do   :P
i don't think he has written all his "items" yet

frktons

Quote from: dedndave on December 18, 2012, 09:03:32 AM
we could use the text from Ray's FPU tutorial for the test
not necessarily all of it - just the parts where individual instructions are explained
each instruction is an "item"
that should be somewhat similar to what Frank intends to do   :P
i don't think he has written all his "items" yet

Exactly Master. So far I'm not sure about the content of the text
to pack inside the program.

If I only had to display some Intel manual pages, explaining single instrucions,
maybe phrases could be somewhat similar.

But I used a page of the Intel Manual only because I had to consult it
to use a couple of SSE instructions during my latest tests.

Something that makes me doubt about the dictionary
is the concept of "phrase" you intend to put inside a dictionary.

Maybe the text content could be technical help, or explanation
of how to use MasmBasic [the latter being an idea in my mind]
or other stuff I actually don't know.

For the time being I just try some generic english text to see
what happens. When I'll post some code you could use your
ideas and experience to make it better, or to implement your
dictionary concept, and we'll see how it fits.
There are only two days a year when you can't do anything: one is called yesterday, the other is called tomorrow, so today is the right day to love, believe, do and, above all, live.

Dalai Lama

frktons

Before going to next step, the Menu Management of the program,
I need your test for Font and size of what you'll see running the
prog attached to the first post of this thread.

Let me know if it fits into your screen, and if you have any problem
displaying it.

Frank
There are only two days a year when you can't do anything: one is called yesterday, the other is called tomorrow, so today is the right day to love, believe, do and, above all, live.

Dalai Lama

dedndave

oops - i think you forgot the attachment   :redface:

i figured you were doing "items" for each of the MMX/SSE instructions
if you are going to expand beyond that, you might have to increase the dictionary and/or LUT sizes
if that's the case, you will not get as good of compression ratio
and - it might be just as well to use some other form of data compression

the example i have shown worked because of the similarities in text between individual instruction descriptions

frktons

Quote from: dedndave on December 18, 2012, 11:58:17 AM
oops - i think you forgot the attachment   :redface:

i figured you were doing "items" for each of the MMX/SSE instructions
if you are going to expand beyond that, you might have to increase the dictionary and/or LUT sizes
if that's the case, you will not get as good of compression ratio
and - it might be just as well to use some other form of data compression

the example i have shown worked because of the similarities in text between individual instruction descriptions

Dave. You don't really read my posts, do you?  :lol:

Try to read again my previous post and you'll find the attachments
of the trade.  :P

Quote from: dedndave on December 18, 2012, 09:03:32 AM
we could use the text from Ray's FPU tutorial for the test not necessarily all of it

I've chosen a more suitable text: a tutorial on data compression from the author of
one of the Apple II software: ShrinkIt as a text to experiment with.  :biggrin:

The idea of a dictionary is a good one. We only have to see and adapt the idea to
the real data. In a long text, in particular technical ones, the same words tend to be used
hundreds of times. So a dictionary with priority for most frequent/long words is in itself
quite nice.
There are only two days a year when you can't do anything: one is called yesterday, the other is called tomorrow, so today is the right day to love, believe, do and, above all, live.

Dalai Lama