How difficult is it to build a shrinking/deflating routine?

dedndave · December 16, 2012, 10:07:46 AM

something i tried long ago was a dictionary of words
each word then had an index number
you create the text by using a series of index numbers
i was surprised at how fast it was :P
you can use the text for some words more than once
for example, the string "aren't" can be used for "a", "are", "aren't"
you can also use the "n't" to make other contractions like "doesn't", "don't", "isn't", etc
of course, if the index is longer than the string, the advantage dwindles a bit

jj2007 · December 16, 2012, 10:50:41 AM

I hacked together an example with my favourite test string, i.e. the contents of \Masm32\include\Windows.inc, using RtlCompressBuffer

The good news: It works.
The bad news: Its performance is hilarious, both in terms of compression time and compression ratio (you read this in Redmond, don't you?

).

RtlGetCompressionWorkSpaceSize OK
edx 4096
$Err$() Operation completed

RtlCompressBuffer OK $Err$() Operation completed

Bytes written vs original size
CompSize 285782 ; << for comparison: WinZip 193990, FreeArc 146200 bytes
LastFileSize 977412

FinalUncompressedSize
ecx 977412
eax 0
$Err$() Operation completed

include \masm32\MasmBasic\MasmBasic.inc ; download
uselib ntdll ; MSDN

.data?
WorkSpace db 4096 dup(?) ;
CompSize dd ?

Init
Let esi=FileRead$("\Masm32\include\Windows.inc") ; sets LastFileSize
Let edi=New$(LastFileSize)
Let ebx=New$(LastFileSize)
sub esp, 8 ; create two dword slots on stack
; COMPRESSION_FORMAT_LZNT1=2 defined by Masm32, OK
COMPRESSION_FORMAT_XPRESS=3 ; no luck under XP SP3
COMPRESSION_FORMAT_XPRESS_HUFF=4 ; no luck under XP SP3
; STANDARD is much faster but compression ration is even more ridiculous
FormatEngine=COMPRESSION_FORMAT_LZNT1 or COMPRESSION_ENGINE_MAXIMUM
invoke RtlGetCompressionWorkSpaceSize, FormatEngine,
esp, esp
pop edx ; CompressBufferWorkSpaceSize, 4096
pop ecx ; CompressFragmentWorkSpaceSize, 0
.if eax!=STATUS_SUCCESS
push eax
Print "RtlGetCompressionWorkSpaceSize: ", Err$()
pop ecx
SWITCH ecx
CASE STATUS_NOT_SUPPORTED
PrintLine "STATUS_NOT_SUPPORTED"
CASE STATUS_INVALID_PARAMETER
PrintLine "STATUS_INVALID_PARAMETER"
CASE STATUS_UNSUPPORTED_COMPRESSION
PrintLine "STATUS_UNSUPPORTED_COMPRESSION"
DEFAULT
PrintLine "STATUS_UNKNOWN"
ENDSW
.else
deb 4, "RtlGetCompressionWorkSpaceSize OK", edx, $Err$()
invoke RtlCompressBuffer, FormatEngine,
esi, LastFileSize,
edi, LastFileSize,
4096, addr CompSize, addr WorkSpace
.if eax!=S_OK
deb 4, "RtlCompressBuffer FAILED", $Err$()
.else
deb 4, "RtlCompressBuffer OK", $Err$()
Open "O", #1, "compressed.zip"
PrintBuffer #1, edi, CompSize
Close #1
deb 4, "Bytes written vs original size", CompSize, LastFileSize

push eax
invoke SetLastError, 0
invoke RtlDecompressBuffer, COMPRESSION_FORMAT_LZNT1, ; MSDN: This parameter must be set to COMPRESSION_FORMAT_LZNT1
ebx, LastFileSize,
edi, LastFileSize,
esp
pop ecx ; FinalUncompressedSize
deb 4, "FinalUncompressedSize", ecx, eax, $Err$()
Open "O", #1, "decompressed.inc"
PrintBuffer #1, ebx, ecx
Close #1
.endif
.endif
Inkey "ok?"
Exit
end start

frktons · December 16, 2012, 12:01:04 PM

Quote from: jj2007 on December 16, 2012, 09:48:28 AM

There is much better compression available, like FreeArc used in MasmBasic's ZipFiles, but the authors rarely offer (de)compression of a buffer.

I already read it Jochen and remember it as well, but I wonder what
prevents you from intercepting the buffer from which you write to the file,
I mean the routine you use. Of course I don't know how much code
you had to write to use FreeArc, but I imagine that if you can write to
an output file, it has to be from a buffer.
FreeArc appears to be open-source, so if you didn't implement the buffer
version is because you don't need it, or [as the author of MasmBasic] you
don't like to offer the decompression of a buffer?

P.S. spero non ti senta sotto interrogatorio, è solo curiosità informatica. :P

frktons · December 16, 2012, 12:13:48 PM

Quote from: jj2007 on December 16, 2012, 10:50:41 AM
I hacked together an example with my favourite test string, i.e. the contents of \Masm32\include\Windows.inc, using RtlCompressBuffer

The good news: It works.
The bad news: Its performance is hilarious, both in terms of compression time and compression ratio (you read this in Redmond, don't you? ).

The most hilarious stuff would be not being able to outperform it
after months of reinventing the wheel :lol:

frktons · December 16, 2012, 12:16:16 PM

Quote from: dedndave on December 16, 2012, 10:07:46 AM
something i tried long ago was a dictionary of words
each word then had an index number
you create the text by using a series of index numbers
i was surprised at how fast it was :P
you can use the text for some words more than once
for example, the string "aren't" can be used for "a", "are", "aren't"
you can also use the "n't" to make other contractions like "doesn't", "don't", "isn't", etc
of course, if the index is longer than the string, the advantage dwindles a bit

This is one of the idea I have in mind. But going from the ideas to a
working code is a completely new world. :lol:

I tried the FreeArc GUI version 0.666 [the last one so far] and you know what?
It compresses the text file 1202 bytes with a worse compression ratio than the
Windows zipper. So I understand why it is better in this particular case to use
the Windows RtlCompress than any other better algo around. For bigger strings it is
different. With my future algo? Some more months and I'll know.

dedndave · December 16, 2012, 12:53:33 PM

i would think the call to RtlGetCompressionWorkSpaceSize is unnecessary and slow
maybe you could just allocate an ample buffer and avoid using that function

although, the compression time is not a big issue, really
you are only going to compress them one time :P

MichaelW · December 16, 2012, 01:10:36 PM

Steve,

I have both of the books. The second is about 550 pages. I have not done any close comparison of the two because I use only the newer one, but they both cover roughly the same material.

Frank,

If you will be compressing the data once and expanding it many times, then I think arithmetic coding would be a good choice. For arithmetic coding the compression rates are low (or very low compared to Huffman and the LZW variants) but the expansion rates for the higher-order models are very high.

frktons · December 16, 2012, 01:16:35 PM

Quote from: dedndave on December 16, 2012, 12:53:33 PM
i would think the call to RtlGetCompressionWorkSpaceSize is unnecessary and slow
maybe you could just allocate an ample buffer and avoid using that function

although, the compression time is not a big issue, really
you are only going to compress them one time :P

You are right Master. The idea is to have them already compressed
inside the program. But somehow (if I don't use RtlCompress) I have
to invent a customized algo to shrink/unpack the data with my own
routine as well.

Quote from: MichaelW on December 16, 2012, 01:10:36 PM

Frank,

If you will be compressing the data once and expanding it many times, then I think arithmetic coding would be a good choice. For arithmetic coding the compression rates are low (or very low compared to Huffman and the LZW variants) but the expansion rates for the higher-order models are very high.

Thanks Michael, that's something to keep in mind. :t

dedndave · December 16, 2012, 01:26:55 PM

the decompress code is fairly easy compared to the compress code :P
i have some ideas on this, but i have another project going that i want to finish before i pick up a new one

frktons · December 16, 2012, 01:56:29 PM

Quote from: dedndave on December 16, 2012, 01:26:55 PM
the decompress code is fairly easy compared to the compress code :P
i have some ideas on this, but i have another project going that i want to finish before i pick up a new one

This project could take months to be accomplished...you'll probably find me
here with a new reinvented wheel under construction when you're back.

My overall idea for compressing data:

Quote
1. Scan the text buffer to shrink and see how many:

a. ASCII codes are used
b. 0-9 ASCII codes
c. A-Z " "
d. a-z " "
e. spaces
f. CRLF
g. special chars [,.:-_[]()!£$%&/|\?'^*+@#....]
h. tokens with n bytes size
j. occurrences of duplicated tokens for each size found
k. and other things that will be necessary

2. Estimate how much the text could be shrinked considering that:

a. all spaces will be removed [a flag will tell the proc when inserting one/more of them]
b. CRLF can be coded as 1 byte token
c. repeated tokens can be substituted with the address [1-2 bytes max]
of the token array/s
d. repeated chars ["------------"] can be coded with appropriate instructions
e. Uppercase can be temporarily converted to lowercase
f. duplicated patterns can be coded conveniently
g. most frequent ASCII codes can be coded as 3-4 bits to save bytes
h. and so on

3. Find a convenient way, both in speed and size, to accomplish the task

4. Test many different routines and choose what seems to suits best the
task.

5. Pay $50 to Dave

jj2007 · December 16, 2012, 02:00:44 PM

Quote from: frktons on December 16, 2012, 12:01:04 PM
what prevents you from intercepting the buffer from which you write to the file

But that's the point: I have to write it to a file, because FreeArc needs a file, not a pointer to a buffer.

Quote from: dedndave on December 16, 2012, 12:53:33 PM
i would think the call to RtlGetCompressionWorkSpaceSize is unnecessary and slow

I timed it, Dave ;-)
RtlGetCompressionWorkSpaceSize took 0 ms
RtlCompressBuffer took 2393 ms

Donkey · December 16, 2012, 02:15:47 PM

If I was to write a simple codec for string files I would begin by reducing every ASCII character to 7 bits with 0000000 as a flag character for non standard cases. This is ofcouse lossless as ASCII is only 7 bits anyway and the loss of zero is not significant. For repeat characters any 3 repeated characters or greater can be replaced by 0000000:nRepeats:char where nRepeats would be 6 bits preceded by 0 (allowing up to 63 repeats) and char would be 7 bits. By reserving the most significant bit in nRepeats you can add special cases by turning the bit on and using the least significant 6 bits to specify the operation to be performed.

Note that this only works on text files using ASCII and not on ANSI (Windows 1252) files.

This is just an idea off the top of my head for a simple text compression algorithm that should be pretty darn fast and though it wouldn't approach the compression rates of any modern algorithm it would be quick and easy enough to write.

Edgar

frktons · December 16, 2012, 02:21:54 PM

Quote from: jj2007 on December 16, 2012, 02:00:44 PM

But that's the point: I have to write it to a file, because FreeArc needs a file, not a pointer to a buffer.

Do you speak C/C++? Most probably yes. You have the source.
In my experience it only means that you don't need or don't like
to modify the 2-3[hundred] instructions needed for implementing it.

As they say: we can. We can write it from scratch, modify it, cheat to it,
bending it to our needs, if we really want to.

Anyway, you have your reasons, and I respect them. You've already done
an impressive amount of work to build MasmBasic, so: "far from me" the
idea you should do some extra work. ;)

Quote from: Donkey on December 16, 2012, 02:15:47 PM
If I was to write a simple codec for string files I would begin by reducing every ASCII character to 7 bits with 0000000 as a flag character for non standard cases. This is ofcouse lossless as ASCII is only 7 bits anyway and the loss of zero is not significant. For repeat characters any 3 repeated characters or greater can be replaced by 0000000:nRepeats:char where nRepeats would be 6 bits preceded by 0 (allowing up to 63 repeats) and char would be 7 bits. By reserving the most significant bit in nRepeats you can add special cases by turning the bit on and using the least significant 6 bits to specify the operation to be performed.

Note that this only works on text files using ASCII and not on ANSI (Windows 1252) files.

This is just an idea off the top of my head for a simple text compression algorithm that should be pretty darn fast and though it wouldn't approach the compression rates of any modern algorithm it would be quick and easy enough to write.

Edgar

Thanks Edgar, nice idea to meditate upon. :t
Unfortunately I'm considering files with ANSI code, so I've to use all the
bits to represent some special chars.

It could happen, however, that some strings don't need special chars, so
it is the case in which your idea could be implemented.

jj2007 · December 16, 2012, 02:36:06 PM

Quote from: frktons on December 16, 2012, 02:21:54 PM
Quote from: jj2007 on December 16, 2012, 02:00:44 PM
But that's the point: I have to write it to a file, because FreeArc needs a file, not a pointer to a buffer.

Do you speak C/C++? Most probably yes. You have the source.

Yes, the source is available here. If you can convince your C compiler to translate LZMA\C\Util\Lzma\LzmaUtil.c into a DLL or LIB, then I will offer your the CompressBuffer function.

My Visual Studio 2010 refuses to convert any "project" older than two years or so, sometimes with loads of cryptic error messages, sometimes with a long silence. Pelles C utters "internal errors" without specifying what's wrong, so it's the usual C mess. But try your luck.

jj2007 · December 16, 2012, 02:42:41 PM

Quote from: Donkey on December 16, 2012, 02:15:47 PMFor repeat characters any 3 repeated characters or greater can be replaced by 0000000:nRepeats:char where nRepeats would be 6 bits preceded by 0 (allowing up to 63 repeats) and char would be 7 bits.

Go a step further and set the rule "everything above 127 is an index into the dictionary of frequently used words"

So the three-byte string chr$(3+128, 32, 7+128) becomes <invoke lstrcpy>, 14 bytes

1 mov
2 add
3 invoke
4 proc
5 endp
6 call
7 lstrcpy

This is more or less the approach of archivers nowadays - find long frequently used byte sequences.

The MASM Forum

News:

How difficult is it to build a shrinking/deflating routine?

dedndave

jj2007

frktons

frktons

frktons

dedndave

MichaelW

frktons

dedndave

frktons

jj2007

Donkey

frktons

jj2007

jj2007