News:

Masm32 SDK description, downloads and other helpful links
Message to All Guests

Main Menu

How difficult is it to build a shrinking/deflating routine?

Started by frktons, December 15, 2012, 10:58:49 AM

Previous topic - Next topic

frktons

Something that puzzles me is the possibility to compress/uncompress data
inside a program, to have, for example, 30 big strings [1000-2000 bytes each]
compressed inside the program in order to use them (when needed) during
the program after decompressing them one at a time, or more at a time.

I tried (a couple of years ago) to use code instead of data, and it worked because
of the nature of data to compress:
http://www.masmforum.com/board/index.php?topic=14734.0

But a general compressing/decompressing routine for text strings is out of my league.

I don't even know where to start.

I'd like to know your opinion and experiences you may have about the argument.

Frank 
There are only two days a year when you can't do anything: one is called yesterday, the other is called tomorrow, so today is the right day to love, believe, do and, above all, live.

Dalai Lama

dedndave

you can read up on LZW compression
that is what GIF and ZIP files use

there are other types of compression, as well
simple packing can reduce some files
they remove repetitious bytes with table entries and expand them from the table to decompress

another method is mathimatical
they find an area that contains fewer than "some number" of used values and compress mathimatically
for example, if you have an area of a file that has no more than 16 unique byte values,
that area may be compressed by a factor of 2
the same thing applies for other powers of 2 - that is just a simple one to envision

frktons

I'd like to compress simple ASCII text strings like:

.data

string1 db "blabalblabalblabalblabalblabalbalbahnbagbgbgabgbagbagbag",
               "blabalblabalblabalblabalblabalbalbahnbagbgbgabgbagbagbag",
               "blabalblabalblabalblabalblabalbalbahnbagbgbgabgbagbagbag" ....

string2 db "blabalblabalblabalblabalblabalbalbahnbagbgbgabgbagbagbag",
               "blabalblabalblabalblabalblabalbalbahnbagbgbgabgbagbagbag",
               "blabalblabalblabalblabalblabalbalbahnbagbgbgabgbagbagbag" ....
......

that make the program 30-100k bigger because of them, in something much
smaller.

According to your exerience and/or knowledge, what kind of shrinking proc
should I consider fit?

The documentation online is quite unreadable [too math formulas] for me.
I need something that is more descriptive, and simpler at the same time,
I really don't need complex mathematical algos.
There are only two days a year when you can't do anything: one is called yesterday, the other is called tomorrow, so today is the right day to love, believe, do and, above all, live.

Dalai Lama

Gunner

It could be as simple as doing something like this:
0605blabalbalbahnbagbgbgabg0303bag
for this sting:
blabalblabalblabalblabalblabalbalbahnbagbgbgabgbagbagbag

Then you could expand the string eaisly:
0605blabalbalbahnbagbgbgabg0303bag

0605= the next 6 bytes are repeated 5 times, 0303=the next 3 bytes are repeated 3 times.
~Rob

frktons

Quote from: Gunner on December 15, 2012, 11:56:03 AM
It could be as simple as doing something like this:
0605blabalbalbahnbagbgbgabg0303bag
for this sting:
blabalblabalblabalblabalblabalbalbahnbagbgbgabgbagbagbag

Then you could expand the string eaisly:
0605blabalbalbahnbagbgbgabg0303bag

0605= the next 6 bytes are repeated 5 times, 0303=the next 3 bytes are repeated 3 times.

Yes Gunner, this is one of the things I've meditated upon. Of course you need to indicate what
has to be duplicated in order to having it actually do something ::)

We could code it like:
Quote
blabal blabal blabal blabal blabal balbahnbagbgbgabgbagbagbag =
0105H ---> indicate that the string number 1 in a strings array ["blabal"]
has to be duplicated five times.

Probably you intended something like that, I suppose.

My apologies, you wrote it in a line of code I didn't see.  :redface:


There are only two days a year when you can't do anything: one is called yesterday, the other is called tomorrow, so today is the right day to love, believe, do and, above all, live.

Dalai Lama

dedndave

let me seeeeee.....

i did something simple, long ago, to compress a text file inside an exe
the math case i used would allow up to 85 unique characters, which were in a translation table
each packed string had 5 bytes and unpacked into 6 characters
it used base 85 math   :P
not great compression, but it saved a few kb on a 21 kb text file and it unpacked quickly

char 1 x 1
+char 2 x 85
+char 3 x 85^2
+char 4 x 85^3
+char 5 x 85^4
+char 6 x 85^5

that adds up to a binary value that will fit into 5 bytes
but - the char was not added directly
the offset of the char in the translation table was always 0 to 84 and that was used

i used that rather than LZW because the code was very simple
in that particular case - if the code was big, nothing was gained - lol

frktons

Probably a real example could help.
If I have the following text in a string, and I want to save
some bytes [half or more of its lenght], how could I do it,
without resorting to complex algos around?
Quote
MOVHLPS— Move Packed Single-Precision Floating-Point Values High to Low

MOVHLPS xmm1, xmm2

SSE

Description
-----------

This instruction cannot be used for memory to register moves.

128-bit two-argument form:
Moves two packed single-precision floating-point values from the high quadword of the second XMM argument
(second operand) to the low quadword of the first XMM register (first argument). The high quadword of the destination
operand is left unchanged. Bits (VLMAX-1:64) of the corresponding YMM destination register are unmodified.

128-bit three-argument form
Moves two packed single-precision floating-point values from the high quadword of the third XMM argument (third
operand) to the low quadword of the destination (first operand). Copies the high quadword from the second XMM
argument (second operand) to the high quadword of the destination (first operand). Bits (VLMAX-1:128) of the
destination YMM register are zeroed.

In 64-bit mode, use of the REX.R prefix permits this instruction to access additional registers (XMM8-XMM15).
If VMOVHLPS is encoded with VEX.L= 1, an attempt to execute the instruction encoded with VEX.L= 1 will cause
an #UD exception.

The text file for this string is about 1,200 bytes. What about reducing it to half?
There are only two days a year when you can't do anything: one is called yesterday, the other is called tomorrow, so today is the right day to love, believe, do and, above all, live.

Dalai Lama

dedndave

half is very ambitious
you might be doing well at 75 to 85 %

frktons

Quote from: dedndave on December 15, 2012, 12:43:47 PM
half is very ambitious
you might be doing well at 75 to 85 %

Did I ever miss a goal?  :lol:
We start from 50% shrinking, and after we'll see.
For the time being I prepare the job.
Here it is the result we have to reach:
http://masm32.com/board/index.php?action=dlattach;topic=1091.0;attach=920
There are only two days a year when you can't do anything: one is called yesterday, the other is called tomorrow, so today is the right day to love, believe, do and, above all, live.

Dalai Lama

dedndave

you might get 50%
it depends on the data
let me look at it....

dedndave

you wanna give me that as a text file ?   :biggrin:
otherwise, i hafta write a little program to convert it to plain text

dedndave

i got it...
:lol:

QuoteMOVHLPS— Move Packed Single-Precision Floating-Point Values High to Low

MOVHLPS xmm1, xmm2

SSE

Description
-----------

This instruction cannot be used for memory to register moves.

128-bit two-argument form:
Moves two packed single-precision floating-point values from the high quadword of the second XMM argument
(second operand) to the low quadword of the first XMM register (first argument). The high quadword of the destination
operand is left unchanged. Bits (VLMAX-1:64) of the corresponding YMM destination register are unmodified.

128-bit three-argument form
Moves two packed single-precision floating-point values from the high quadword of the third XMM argument (third
operand) to the low quadword of the destination (first operand). Copies the high quadword from the second XMM
argument (second operand) to the high quadword of the destination (first operand). Bits (VLMAX-1:128) of the
destination YMM register are zeroed.

In 64-bit mode, use of the REX.R prefix permits this instruction to access additional registers (XMM8-XMM15).
If VMOVHLPS is encoded with VEX.L= 1, an attempt to execute the instruction encoded with VEX.L= 1 will cause
an #UD exception.

dedndave

i'd be surprised if you can get that down to 50%

actually - a ZIP file of it is 53%
it looks like LZW is a good choice, as that goes
however, the LZW decompression code is likely to be a few hundred bytes   :P
that's if you use your own code
now - there is probably a way to use the OS unzipper to decompress it for you
never tried that - but i am sure someone in here has

frktons

Quote from: dedndave on December 15, 2012, 01:14:03 PM
i'd be surprised if you can get that down to 50%

It'll take some time, but we'll do it, as usual  :lol:
There are only two days a year when you can't do anything: one is called yesterday, the other is called tomorrow, so today is the right day to love, believe, do and, above all, live.

Dalai Lama

dedndave

so - zip it
put it in the EXE as a raw binary resource
use a routine like the one i posted last week to get data from a resource
figure out how to use the OS's unzipper
and you are there