News:

Masm32 SDK description, downloads and other helpful links
Message to All Guests

Main Menu

How difficult is it to build a shrinking/deflating routine?

Started by frktons, December 15, 2012, 10:58:49 AM

Previous topic - Next topic

dedndave

something i tried long ago was a dictionary of words
each word then had an index number
you create the text by using a series of index numbers
i was surprised at how fast it was   :P
you can use the text for some words more than once
for example, the string "aren't" can be used for "a", "are", "aren't"
you can also use the "n't" to make other contractions like "doesn't", "don't", "isn't", etc
of course, if the index is longer than the string, the advantage dwindles a bit

jj2007

I hacked together an example with my favourite test string, i.e. the contents of \Masm32\include\Windows.inc, using RtlCompressBuffer

The good news: It works.
The bad news: Its performance is hilarious, both in terms of compression time and compression ratio (you read this in Redmond, don't you?  :greensml:).

RtlGetCompressionWorkSpaceSize OK
edx             4096
$Err$()         Operation completed

RtlCompressBuffer OK    $Err$()         Operation completed

Bytes written vs original size
CompSize        285782   ; << for comparison: WinZip 193990, FreeArc 146200 bytes
LastFileSize    977412

FinalUncompressedSize
ecx             977412
eax             0
$Err$()         Operation completed

include \masm32\MasmBasic\MasmBasic.inc        ; download
uselib ntdll                ; MSDN

.data?
WorkSpace        db 4096 dup(?)        ;
CompSize        dd ?

  Init
  Let esi=FileRead$("\Masm32\include\Windows.inc")        ; sets LastFileSize
  Let edi=New$(LastFileSize)
  Let ebx=New$(LastFileSize)
  sub esp, 8                ; create two dword slots on stack
;  COMPRESSION_FORMAT_LZNT1=2        defined by Masm32, OK
  COMPRESSION_FORMAT_XPRESS=3        ; no luck under XP SP3
  COMPRESSION_FORMAT_XPRESS_HUFF=4        ; no luck under XP SP3
  ; STANDARD is much faster but compression ration is even more ridiculous
  FormatEngine=COMPRESSION_FORMAT_LZNT1 or COMPRESSION_ENGINE_MAXIMUM
  invoke RtlGetCompressionWorkSpaceSize, FormatEngine,
  esp, esp
  pop edx                        ; CompressBufferWorkSpaceSize, 4096
  pop ecx                        ; CompressFragmentWorkSpaceSize, 0
  .if eax!=STATUS_SUCCESS
        push eax
        Print "RtlGetCompressionWorkSpaceSize: ", Err$()
        pop ecx
        SWITCH ecx
        CASE STATUS_NOT_SUPPORTED
                PrintLine "STATUS_NOT_SUPPORTED"
        CASE STATUS_INVALID_PARAMETER
                PrintLine "STATUS_INVALID_PARAMETER"
        CASE STATUS_UNSUPPORTED_COMPRESSION
                PrintLine "STATUS_UNSUPPORTED_COMPRESSION"
        DEFAULT
                PrintLine "STATUS_UNKNOWN"
        ENDSW
  .else
        deb 4, "RtlGetCompressionWorkSpaceSize OK", edx, $Err$()
        invoke RtlCompressBuffer, FormatEngine,
          esi, LastFileSize,
          edi, LastFileSize,
          4096, addr CompSize, addr WorkSpace
        .if eax!=S_OK
                deb 4, "RtlCompressBuffer FAILED", $Err$()
        .else
                deb 4, "RtlCompressBuffer OK", $Err$()
                Open "O", #1, "compressed.zip"
                PrintBuffer #1, edi, CompSize
                Close #1
                deb 4, "Bytes written vs original size", CompSize, LastFileSize

                push eax
                invoke SetLastError, 0
                invoke RtlDecompressBuffer, COMPRESSION_FORMAT_LZNT1,        ; MSDN: This parameter must be set to COMPRESSION_FORMAT_LZNT1
                  ebx, LastFileSize,
                  edi, LastFileSize,
                  esp
                pop ecx        ; FinalUncompressedSize
                deb 4, "FinalUncompressedSize", ecx, eax, $Err$()
                Open "O", #1, "decompressed.inc"
                PrintBuffer #1, ebx, ecx
                Close #1
        .endif
  .endif
  Inkey "ok?"
  Exit
end start

frktons

Quote from: jj2007 on December 16, 2012, 09:48:28 AM

There is much better compression available, like FreeArc used in MasmBasic's ZipFiles, but the authors rarely offer (de)compression of a buffer.

I already read it Jochen and remember it as well, but I wonder what
prevents you from intercepting the buffer from which you write to the file,
I mean the routine you use. Of course I don't know how much code
you had to write to use FreeArc, but I imagine that if you can write to
an output file, it has to be from a buffer.
FreeArc appears to be open-source, so if you didn't implement the buffer
version is because you don't need it, or [as the author of MasmBasic] you
don't like to offer the decompression of a buffer?

P.S. spero non ti senta sotto interrogatorio, è solo curiosità informatica.  :P

There are only two days a year when you can't do anything: one is called yesterday, the other is called tomorrow, so today is the right day to love, believe, do and, above all, live.

Dalai Lama

frktons

Quote from: jj2007 on December 16, 2012, 10:50:41 AM
I hacked together an example with my favourite test string, i.e. the contents of \Masm32\include\Windows.inc, using RtlCompressBuffer

The good news: It works.
The bad news: Its performance is hilarious, both in terms of compression time and compression ratio (you read this in Redmond, don't you?  :greensml:).

The most hilarious stuff would be not being able to outperform it
after months of reinventing the wheel :lol:
There are only two days a year when you can't do anything: one is called yesterday, the other is called tomorrow, so today is the right day to love, believe, do and, above all, live.

Dalai Lama

frktons

Quote from: dedndave on December 16, 2012, 10:07:46 AM
something i tried long ago was a dictionary of words
each word then had an index number
you create the text by using a series of index numbers
i was surprised at how fast it was   :P
you can use the text for some words more than once
for example, the string "aren't" can be used for "a", "are", "aren't"
you can also use the "n't" to make other contractions like "doesn't", "don't", "isn't", etc
of course, if the index is longer than the string, the advantage dwindles a bit

This is one of the idea I have in mind. But going from the ideas to a
working code is a completely new world. :lol:

I tried the FreeArc GUI version 0.666 [the last one so far] and you know what?
It compresses the text file 1202 bytes with a worse compression ratio than the
Windows zipper. So I understand why it is better in this particular case to use
the Windows RtlCompress than any other better algo around. For bigger strings it is
different. With my future algo? Some more months and I'll know.
There are only two days a year when you can't do anything: one is called yesterday, the other is called tomorrow, so today is the right day to love, believe, do and, above all, live.

Dalai Lama

dedndave

i would think the call to RtlGetCompressionWorkSpaceSize is unnecessary and slow
maybe you could just allocate an ample buffer and avoid using that function

although, the compression time is not a big issue, really
you are only going to compress them one time   :P

MichaelW

Steve,

I have both of the books. The second is about 550 pages. I have not done any close comparison of the two because I use only the newer one, but they both cover roughly the same material.

Frank,

If you will be compressing the data once and expanding it many times, then I think arithmetic coding would be a good choice. For arithmetic coding the compression rates are low (or very low compared to Huffman and the LZW variants) but the expansion rates for the higher-order models are very high.
Well Microsoft, here's another nice mess you've gotten us into.

frktons

Quote from: dedndave on December 16, 2012, 12:53:33 PM
i would think the call to RtlGetCompressionWorkSpaceSize is unnecessary and slow
maybe you could just allocate an ample buffer and avoid using that function

although, the compression time is not a big issue, really
you are only going to compress them one time   :P

You are right Master. The idea is to have them already compressed
inside the program. But somehow (if I don't use RtlCompress) I have
to invent a customized algo to shrink/unpack the data with my own
routine as well.

Quote from: MichaelW on December 16, 2012, 01:10:36 PM

Frank,

If you will be compressing the data once and expanding it many times, then I think arithmetic coding would be a good choice. For arithmetic coding the compression rates are low (or very low compared to Huffman and the LZW variants) but the expansion rates for the higher-order models are very high.


Thanks Michael, that's something to keep in mind.  :t
There are only two days a year when you can't do anything: one is called yesterday, the other is called tomorrow, so today is the right day to love, believe, do and, above all, live.

Dalai Lama

dedndave

the decompress code is fairly easy compared to the compress code   :P
i have some ideas on this, but i have another project going that i want to finish before i pick up a new one

frktons

Quote from: dedndave on December 16, 2012, 01:26:55 PM
the decompress code is fairly easy compared to the compress code   :P
i have some ideas on this, but i have another project going that i want to finish before i pick up a new one

This project could take months to be accomplished...you'll probably find me
here with a new reinvented wheel under construction when you're back.

My overall idea for compressing data:

Quote
1. Scan the text buffer to shrink and see how many:

    a. ASCII codes are used
    b. 0-9 ASCII codes
    c. A-Z  "         "
    d. a-z  "         "
    e. spaces
    f.  CRLF
    g.  special chars [,.:-_[]()!£$%&/|\?'^*+@#....]
    h.  tokens with n bytes size
    j.   occurrences of duplicated tokens for each size found
    k.  and other things that will be necessary

2.  Estimate how much the text could be shrinked considering that:

    a. all spaces will be removed [a flag will tell the proc when inserting one/more of them]
    b. CRLF can be coded as 1 byte token
    c. repeated tokens can be substituted with the address [1-2 bytes max]
       of the token array/s
    d. repeated chars ["------------"] can be coded with appropriate instructions
    e. Uppercase can be temporarily converted to lowercase
    f. duplicated patterns can be coded conveniently
    g. most frequent ASCII codes can be coded as 3-4 bits to save bytes
    h. and so on

3.  Find a convenient way, both in speed and size, to accomplish the task

4.  Test many different routines and choose what seems to suits best the
     task.

5.  Pay $50 to Dave
There are only two days a year when you can't do anything: one is called yesterday, the other is called tomorrow, so today is the right day to love, believe, do and, above all, live.

Dalai Lama

jj2007

Quote from: frktons on December 16, 2012, 12:01:04 PM
what prevents you from intercepting the buffer from which you write to the file

But that's the point: I have to write it to a file, because FreeArc needs a file, not a pointer to a buffer.

Quote from: dedndave on December 16, 2012, 12:53:33 PM
i would think the call to RtlGetCompressionWorkSpaceSize is unnecessary and slow

I timed it, Dave ;-)
RtlGetCompressionWorkSpaceSize took 0 ms
RtlCompressBuffer took 2393 ms

Donkey

If I was to write a simple codec for string files I would begin by reducing every ASCII character to 7 bits with 0000000 as a flag character for non standard cases. This is ofcouse lossless as ASCII is only 7 bits anyway and the loss of zero is not significant. For repeat characters any 3 repeated characters or greater can be replaced by 0000000:nRepeats:char where nRepeats would be 6 bits preceded by 0 (allowing up to 63 repeats) and char would be 7 bits. By reserving the most significant bit in nRepeats you can add special cases by turning the bit on and using the least significant 6 bits to specify the operation to be performed.

Note that this only works on text files using ASCII and not on ANSI (Windows 1252) files.

This is just an idea off the top of my head for a simple text compression algorithm that should be pretty darn fast and though it wouldn't approach the compression rates of any modern algorithm it would be quick and easy enough to write.

Edgar
"Ahhh, what an awful dream. Ones and zeroes everywhere...[shudder] and I thought I saw a two." -- Bender
"It was just a dream, Bender. There's no such thing as two". -- Fry
-- Futurama

Donkey's Stable

frktons

Quote from: jj2007 on December 16, 2012, 02:00:44 PM

But that's the point: I have to write it to a file, because FreeArc needs a file, not a pointer to a buffer.


Do you speak C/C++? Most probably yes. You have the source.
In my experience it only means that you don't need or don't like
to modify the 2-3[hundred] instructions needed for implementing it.

As they say: we can. We can write it from scratch, modify it, cheat to it,
bending it to our needs, if we really want to.

Anyway, you have your reasons, and I respect them. You've already done
an impressive amount of work to build MasmBasic, so: "far from me" the
idea you should do some extra work.  ;)

Quote from: Donkey on December 16, 2012, 02:15:47 PM
If I was to write a simple codec for string files I would begin by reducing every ASCII character to 7 bits with 0000000 as a flag character for non standard cases. This is ofcouse lossless as ASCII is only 7 bits anyway and the loss of zero is not significant. For repeat characters any 3 repeated characters or greater can be replaced by 0000000:nRepeats:char where nRepeats would be 6 bits preceded by 0 (allowing up to 63 repeats) and char would be 7 bits. By reserving the most significant bit in nRepeats you can add special cases by turning the bit on and using the least significant 6 bits to specify the operation to be performed.

Note that this only works on text files using ASCII and not on ANSI (Windows 1252) files.

This is just an idea off the top of my head for a simple text compression algorithm that should be pretty darn fast and though it wouldn't approach the compression rates of any modern algorithm it would be quick and easy enough to write.

Edgar

Thanks Edgar, nice idea to meditate upon.  :t
Unfortunately I'm considering files with ANSI code, so I've to use all the
bits to represent some special chars.

It could happen, however, that some strings don't need special chars, so
it is the case in which your idea could be implemented.
There are only two days a year when you can't do anything: one is called yesterday, the other is called tomorrow, so today is the right day to love, believe, do and, above all, live.

Dalai Lama

jj2007

Quote from: frktons on December 16, 2012, 02:21:54 PM
Quote from: jj2007 on December 16, 2012, 02:00:44 PM
But that's the point: I have to write it to a file, because FreeArc needs a file, not a pointer to a buffer.

Do you speak C/C++? Most probably yes. You have the source.

Yes, the source is available here. If you can convince your C compiler to translate LZMA\C\Util\Lzma\LzmaUtil.c into a DLL or LIB, then I will offer your the CompressBuffer function.

My Visual Studio 2010 refuses to convert any "project" older than two years or so, sometimes with loads of cryptic error messages, sometimes with a long silence. Pelles C utters "internal errors" without specifying what's wrong, so it's the usual C mess. But try your luck.

jj2007

Quote from: Donkey on December 16, 2012, 02:15:47 PMFor repeat characters any 3 repeated characters or greater can be replaced by 0000000:nRepeats:char where nRepeats would be 6 bits preceded by 0 (allowing up to 63 repeats) and char would be 7 bits.

Go a step further and set the rule "everything above 127 is an index into the dictionary of frequently used words"

So the three-byte string chr$(3+128, 32, 7+128) becomes <invoke lstrcpy>, 14 bytes

1 mov
2 add
3 invoke
4 proc
5 endp
6 call
7 lstrcpy

This is more or less the approach of archivers nowadays - find long frequently used byte sequences.