The shadow of a binary tree

FORTRANS · February 08, 2013, 02:04:23 AM

Hi,


; - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
Hash_rec	STRUC
HCode   DW      ?       ; Code of the symbol in slot i, or 0 if empty slot.
HPrefix DW      ?       ; Symbol's prefix code; undefined if empty slot.
HSuffix DB      ?       ; Symbol Suffix character.
Hash_rec        ENDS

HSize   EQU     5120    ; Hash table size for ~80% occupancy.  Not prime
                        ; As AoCP wants.

LZWHash Hash_rec  HSize DUP(<>) ; syntax for a structure.

; - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
; 27 March 2008, make a hash to index the LZW token array.  Use logic
; similar to Unix COMPRESS code.  Make a separate routine for cleaner
; code, and possibility to use a different algorithm later on.
INDEX:
        PUSH    BX      ; Not sure if this is necessary...
        PUSH    CX      ; But protect yourself while debugging.

        MOV     SI,AX   ; Move character into Source Index register.
        MOV     CX,4
        SHL     SI,CL   ; Shift to make comparable to table size.
        ADD     SI,[Prefix] ; Make hash dependent on both prefix and character.
        CMP     SI,HSize ; Too big?
        JB      Idx1
        SUB     SI,HSize ; Yep.
Idx1:
; Multiply hash number by record size.
        MOV     BX,SI   ; Five bytes per table entry.
        SHL     SI,1
        SHL     SI,1
        ADD     SI,BX
        ADD     SI,OFFSET LZWHash       ; And point into the table

        POP     CX
        POP     BX

        RET

frktons · February 08, 2013, 04:36:21 AM

Quote from: FORTRANS on February 08, 2013, 02:04:23 AM
Hi,

Code Select Expand
; - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - Hash_rec STRUC HCode DW ? ; Code of the symbol in slot i, or 0 if empty slot. HPrefix DW ? ; Symbol's prefix code; undefined if empty slot. HSuffix DB ? ; Symbol Suffix character. Hash_rec ENDS HSize EQU 5120 ; Hash table size for ~80% occupancy. Not prime ; As AoCP wants. LZWHash Hash_rec HSize DUP(<>) ; syntax for a structure. ; - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - ; 27 March 2008, make a hash to index the LZW token array. Use logic ; similar to Unix COMPRESS code. Make a separate routine for cleaner ; code, and possibility to use a different algorithm later on. INDEX: PUSH BX ; Not sure if this is necessary... PUSH CX ; But protect yourself while debugging. MOV SI,AX ; Move character into Source Index register. MOV CX,4 SHL SI,CL ; Shift to make comparable to table size. ADD SI,[Prefix] ; Make hash dependent on both prefix and character. CMP SI,HSize ; Too big? JB Idx1 SUB SI,HSize ; Yep. Idx1: ; Multiply hash number by record size. MOV BX,SI ; Five bytes per table entry. SHL SI,1 SHL SI,1 ADD SI,BX ADD SI,OFFSET LZWHash ; And point into the table POP CX POP BX RET

Adapting it to 32 bit code should not be difficult.
It is just a matter of replacing 16 bit regs with their 32 bit counterpart,
or there is some trick to be aware of?

Thanks anyway

FORTRANS · February 08, 2013, 05:49:53 AM

Hi,

Quote from: frktons on February 08, 2013, 04:36:21 AM
Adapting it to 32 bit code should not be difficult.
It is just a matter of replacing 16 bit regs with their 32 bit counterpart,
or there is some trick to be aware of?

No tricks really. The operations here only use 13-bits until
you add in the offset to the table. Of course GIF files were
defined to be easy to create, often on 8-bit computers.
When going to 32-bits you have a lot more resources available.
If your hash table is much larger, you may want to use a more
sophisticated hash algorithm as this one can take a little while
to "warm up" in a few situations. It does not matter in this
application, as it will still work due to the successful processing
of collisions. And in practice, it seems to work well.

Note that the collision processing generally requires a second
hash function to look for an unused table entery. But in my
case a really simple one works. Brute force. If you get a really
large hash table that is heavily used you might want to read up
on the theory in Knuth's book or similar.

As you can see in the code comments, I thought I might need
a better hash function as this one looks simple. But, in testing
it, it worked well.

Regards,

Steve N.

frktons · February 08, 2013, 05:56:27 AM

Quote from: FORTRANS on February 08, 2013, 05:49:53 AM

As you can see in the code comments, I thought I might need
a better hash function as this one looks simple. But, in testing
it, it worked well.

Regards,

Steve N.

Steve, did you test it to see how many hashes per second it generates?
Only testing performance you can decide if it is fast enough or you could
better implement a new one.

Frank

FORTRANS · February 08, 2013, 07:32:05 AM

Hi Frank,

No. I tested for evidence of good or bad behavior. I looked
to see how the table filled up. If it filled one part too much, it
would have too many collisions. The entries were spread out
fairly well. I just looked at the table using DEBUG after letting
it run for a while. If it had looked bad, I would either generate
statistics on the collisions or just use some other hash function
suggested by Knuth or Nelson or smack this one till it looked
okay. Hm, statistics? Might be fun to look at.

Generating the hash table index is so small compared
to the rest of the compression code that speed was never a
question. The other algorithims I looked at use either a divide
or a multiply or both. Those would have to be slower than an
add. (At least on my older computers.) The program is fast
enough on my computers that I use for programming. Though
the rotate in this one might hurt a bit. Just a quick check to
see a rotate is better than a multiply on older cpus.

Cheers,

Steve N.

FORTRANS · February 09, 2013, 07:34:08 AM

Hi,

Should have said shift rather than rotate above.

Well I put some code in to see how many collisions occur.
The first test got 16088 collisions for 100318 indicies or 16%.
A second test was run to test what I hoped would be an
improvement in the algorithm. The number of collisions was
5683 with the newer version and was 2656 with the posted
version, the number of indicies was 801773, so 0.7% and
0.3%. The major difference is the first test was for a 256
color file and the second was 16 color.

Cheers,

Steve N.

frktons · February 09, 2013, 10:24:59 AM

Quote from: FORTRANS on February 09, 2013, 07:34:08 AM
Hi,

Should have said shift rather than rotate above.

Well I put some code in to see how many collisions occur.
The first test got 16088 collisions for 100318 indicies or 16%.
A second test was run to test what I hoped would be an
improvement in the algorithm. The number of collisions was
5683 with the newer version and was 2656 with the posted
version, the number of indicies was 801773, so 0.7% and
0.3%. The major difference is the first test was for a 256
color file and the second was 16 color.

Cheers,

Steve N.

Not bad Steve. :t
In my case, working on text file, I'll probably hit the same
number of collisions than the one with 256 colors, being
256 the number of ANSI codes.
Solving the "collision problem" in a fast way should be the key for
a good hash algo. It is the most "time consuming part" for what
I can foresee.

hutch-- · February 09, 2013, 12:08:57 PM

Frank,

With a decent hash table, the slowest part will always be the number generation, if your collision detection is done properly it does not noticably slow down until you have something like 80% table saturation which you should never reach if you allocate a table big enough to handle your data. Rough rule is 50% saturation will barely see any slow down. Now there are a few tricks to improve the distribution along the table, make the collision stepping dependent on the data length so that each entry is potentially different from the last one, this means you have a better chance of finding an empty slot if you do not keep stepping the same amount each time.

The action will always be in how you hash the source data, the more unique the result, the less collisions you get and one of the most common techniques is to multiply then return the remainder as it will be in the range of your table size. It usually involves getting a number for each character in your source data and there are a number of ways to do this, a lookup table for each character position allows you to use a table of prime numbers so that you do not get even stepping and a reduced number of results.

FORTRANS · February 10, 2013, 03:35:26 AM

Quote from: frktons on February 09, 2013, 10:24:59 AM
Not bad Steve.

Hi Frank,

Well 16% is a bit high. Though livable. Less than 1%
is great. I think the large amount of background color in
that image may be a factor. The 16 color image has a non-
uniform background.

Quote
In my case, working on text file, I'll probably hit the same
number of collisions than the one with 256 colors, being
256 the number of ANSI codes.

Actually text is skewed to fewer symbols. English or programs
has letters, numbers, punctuation, white space, and a few
symbols. And the letters are skewed towards things like vowels
and lower case , and away from capital letters, q's, and z's
So maybe a hundred total and thirty common. Just looked
at a large assembly program of mine, and it has about a third
comprised of just spaces, CR, and LF. (I don't use tabs.)
It does use all 256 codes though.

Quote
Solving the "collision problem" in a fast way should be the key for
a good hash algo. It is the most "time consuming part" for what
I can foresee.

Two comparisons and two conditional jumps in this case to
determine if a collision occurs. And then make a change in the
index and check again. Almost nothing compared to the bit
mashing and record keeping to create the variable length LZW
tokens and putting them in a file.

Hutch, good comments. Knuth used 80% as a saturation
as that is where his programs started to be adversely affected.
And he wrote the book (at least the edition I have) when
computers were much smaller, so space often was often a
trade off with speed.

I may do some further tests...

Regards,

Steve N.

frktons · February 10, 2013, 09:30:16 AM

Thanks Steve I and II.

I'm studying a little of this stuff before choosing what kind of algorithm
to implement. Better ho have a clear idea of what to do, otherwise I'm
not able to start coding.

I'll post my ideas as they becomes clean and organized.

Frank

FORTRANS · February 11, 2013, 08:17:38 AM

Hi,

Ran some tests creating 256 color GIF files. These were
four or five times larger than the one I mentioned above.
Most had 10 to 12 % collisions. But one interesting data
point was an upside down picture had 21% collisions.
When I flipped the image over, it had only 11%. Sheesh.

Cheers,

Steve N.

frktons · February 11, 2013, 08:29:54 AM

Quote from: FORTRANS on February 11, 2013, 08:17:38 AM
Hi,

Ran some tests creating 256 color GIF files. These were
four or five times larger than the one I mentioned above.
Most had 10 to 12 % collisions. But one interesting data
point was an upside down picture had 21% collisions.
When I flipped the image over, it had only 11%. Sheesh.

Cheers,

Steve N.

This looks a bit odd. I don't know what this could mean...
By the way, are you still using the 16 bit code, or you
updated it to 32 bit?

Frank

FORTRANS · February 11, 2013, 09:38:12 AM

Hi Frank,

Quote from: frktons on February 11, 2013, 08:29:54 AM
This looks a bit odd. I don't know what this could mean...

If you mean 21% to 10% collisions, it means that the LZW
algorithm combined with my hash algorithm is quite data
dependent. Some patterns are worse than others. I mentioned
this to see if anyone else has seen something similar.

Quote
By the way, are you still using the 16 bit code, or you
updated it to 32 bit?

No, it is still 16-bit. The arithmetic fits into word and byte
sized variables. The only 32-bit item in the code is a DWORD
(really two words) to count the indicies to report the collision
information. And the code is ugly. If I upgraded to 32-bit
code I sould probably just start over. The executable is about
33.5 kb, of which 25.6 kb is the data for the hash table. And
it runs on my 80186 computer if needed. Hm, haven't tried
that recently.

I have written some 32-bit code. Some as subroutines for
16-bit programs. Some as modifications to programs that
people have posted here in the forum.

Cheers,

Steve N.

hutch-- · February 11, 2013, 10:01:18 AM

Frank,

In the example code in MASM32 there is a hash table example for performing word replacements on text data. It loads a word list into the table then replaces each word in the source that is in the loaded table with its replacement in the loaded table. This example is one of the simpler ones that uses fixed memory and it does the basic things that a hash table needs to do.

\masm32\examples\advanced\wrep

frktons · February 11, 2013, 10:28:26 AM

Quote from: FORTRANS on February 11, 2013, 09:38:12 AM
If you mean 21% to 10% collisions, it means that the LZW
algorithm combined with my hash algorithm is quite data
dependent. Some patterns are worse than others. I mentioned
this to see if anyone else has seen something similar.

... If I upgraded to 32-bit
code I sould probably just start over. The executable is about
33.5 kb, of which 25.6 kb is the data for the hash table. And
it runs on my 80186 computer if needed. Hm, haven't tried
that recently.

Steve N.

Ok Steve, it was just a curiosity.

Quote from: hutch-- on February 11, 2013, 10:01:18 AM
Frank,

In the example code in MASM32 there is a hash table example for performing word replacements on text data. It loads a word list into the table then replaces each word in the source that is in the loaded table with its replacement in the loaded table. This example is one of the simpler ones that uses fixed memory and it does the basic things that a hash table needs to do.

\masm32\examples\advanced\wrep

Thanks Steve, I'll have a look at it. :t

The MASM Forum

News:

The shadow of a binary tree

FORTRANS

frktons

FORTRANS

frktons

FORTRANS

FORTRANS

frktons

hutch--

FORTRANS

frktons

FORTRANS

frktons

FORTRANS

hutch--

frktons