News:

Masm32 SDK description, downloads and other helpful links
Message to All Guests

Main Menu

The shadow of a binary tree

Started by frktons, February 03, 2013, 06:54:01 PM

Previous topic - Next topic

frktons

I never had occasion to build a binary tree. My knowledge about
the matter is quite non-existent.
I only know they have a structure with a root and branches on the left
for items with smaller value, and on the right for bigger values.
Each item has its own value [a number, a string ...] and two pointers
for a lesser value and a bigger one.
How to declare them, how to create them, insert items, delete, move
from a position to another, and all the operations involved, including
searching and balancing the tree are still a mistery.
Considering they could be useful, if anybody has some experience with
them, please give me your opinion regarding the "when" they are useful,
and how to translate theory in ASM code to create, for example, a
binary tree starting from a non-sorted array of random numbers/strings.

Thanks for any help you can give.

Frank


 
There are only two days a year when you can't do anything: one is called yesterday, the other is called tomorrow, so today is the right day to love, believe, do and, above all, live.

Dalai Lama

qWord

hello,
the English Wikipedia has good articles for the different kind of trees - there you can read up for what they are commonly used. Even if you do not fully understand them currently, it should be no problem for you to translate existing HLL code to ASM. For an example, I use Red-Black trees for huge (sorted) data sets, which has a high search rate and less insertion/deletions. One advantage of them is that you can quickly step through data in (or reversed) order. In context to search trees, you may also take a look at has tables.
MREAL macros - when you need floating point arithmetic while assembling!

frktons

Thanks qWord.

The few things I know about binary trees were collected through wikipedia
and the net. Maybe if I detail the task I need to perform, it'll be more clear
if binary trees are or not the best option.

In my slow proceeding to the text compression task, I've reached the point
where I have to scan the input text file, map the words inside the file, count
the recurrences for each word, and do something else to prepare data for next
steps.

I've thought about arrays, first, but it is probably not the best solution. So after some
research I met binary trees, and thinking for a while about their possible
use, I believed they could be the solution for a fast building of the word
dictionary, scanning the input file only once, and  doing all the connected
operations at the same time: counting, selecting, matching, connecting with other
special tokens, and so on.

Being new in both ASM and binary trees use, I've only a vague idea about the
sequence of operations to define data structure for a binary tree, and use it, while
feeding it with the words found in input text file.

I'll try to write a sample code and after see how to make it better, and consistent
with specifications.
There are only two days a year when you can't do anything: one is called yesterday, the other is called tomorrow, so today is the right day to love, believe, do and, above all, live.

Dalai Lama

FORTRANS

Hi,

   For LZW compression a hash table is commonly used.
Some people use a tree to do the bookkeeping when they
don't like hashing.  When I wrote my GIF creater, using its
modified LZW compression, I used a simple hash function.
Sort of quick and easy.

   In "The Data Compression Book", LZSS compression
uses a tree to hold the information needed.  So that might
be a resource for you.

Regards,

Steve N.

frktons

Quote from: FORTRANS on February 05, 2013, 12:31:47 AM
Hi,

   For LZW compression a hash table is commonly used.
Some people use a tree to do the bookkeeping when they
don't like hashing.  When I wrote my GIF creater, using its
modified LZW compression, I used a simple hash function.
Sort of quick and easy.

   In "The Data Compression Book", LZSS compression
uses a tree to hold the information needed.  So that might
be a resource for you.

Regards,

Steve N.

Thanks Steve.

Being new to both hash tables and binary trees, I'm not able to
grasp the advantages/disadvantages of the two different approach.

I need some more info before I can decide what's best suited.

Frank
There are only two days a year when you can't do anything: one is called yesterday, the other is called tomorrow, so today is the right day to love, believe, do and, above all, live.

Dalai Lama

Tedd

A hash table is better suited :P

The performance of a (plain) binary tree is highly dependent on the order that you insert items, and can easily devolve into a linear linked list. For item lookup and frequency counting, the hash table works in fairly constant time for all items; though can require a little more memory than a binary tree of the same items.
Potato2

frktons

Quote from: Tedd on February 05, 2013, 02:45:21 AM
A hash table is better suited :P

The performance of a (plain) binary tree is highly dependent on the order that you insert items, and can easily devolve into a linear linked list. For item lookup and frequency counting, the hash table works in fairly constant time for all items; though can require a little more memory than a binary tree of the same items.


Thanks Tedd. Good info to think about. Instead of studying binary trees
maybe is better to study hash management  :biggrin:
There are only two days a year when you can't do anything: one is called yesterday, the other is called tomorrow, so today is the right day to love, believe, do and, above all, live.

Dalai Lama

hutch--

A hash table is a good solution if you have some idea of how many items you wish to handle. A tree can extend to the limit of memory where a hash table tends to be limited to its original count. With a hash table you need to design a decent method of collision detection but if you get it right you can saturate the table to about 80% before it starts to slow down.

If you can get the mechanics right, the next trick is to choose between a fixed allocation of memory for each slot versus dynamic allocation, the first is far faster but at the cost of inflexibility and high memory usage, the latter is very flexible but slower. With the latter you can make the table much bigger and unless the slot is used it only uses a very small amount of memory.

FORTRANS

Quote from: frktons on February 05, 2013, 12:52:31 AM
I need some more info before I can decide what's best suited.

Hi,

   "The Data Compression Book", by Mark Nelson has a
decent discussion of trees and a little on hash tables.  For
my program I looked at the source for a version of Unix
compress, and some papers fom internet sources.  One
was  LZW and GIF explained by Steve Blackstock.  Another
was  LZW compression used to encode/decode a GIF file by
Bob Montgomery

   For a more thorough discussion, I used "The Art of Computer
Programming", Volume 3/ Searching and Sorting, by Donald
Knuth.  My hash function used a little bit from this (although
misused might also apply).

Regards,

Steve N.

frktons

Quote from: hutch-- on February 06, 2013, 04:28:32 PM
A hash table is a good solution if you have some idea of how many items you wish to handle. A tree can extend to the limit of memory where a hash table tends to be limited to its original count. With a hash table you need to design a decent method of collision detection but if you get it right you can saturate the table to about 80% before it starts to slow down.

If you can get the mechanics right, the next trick is to choose between a fixed allocation of memory for each slot versus dynamic allocation, the first is far faster but at the cost of inflexibility and high memory usage, the latter is very flexible but slower. With the latter you can make the table much bigger and unless the slot is used it only uses a very small amount of memory.

The hash table should have about 64K entries, but it varies depending on
input text file size. Actual memory in PC is big enough for the task, I guess.

Quote from: FORTRANS on February 07, 2013, 12:36:41 AM
Quote from: frktons on February 05, 2013, 12:52:31 AM
I need some more info before I can decide what's best suited.

Hi,

   "The Data Compression Book", by Mark Nelson has a
decent discussion of trees and a little on hash tables.  For
my program I looked at the source for a version of Unix
compress, and some papers fom internet sources.  One
was  LZW and GIF explained by Steve Blackstock.  Another
was  LZW compression used to encode/decode a GIF file by
Bob Montgomery

   For a more thorough discussion, I used "The Art of Computer
Programming", Volume 3/ Searching and Sorting, by Donald
Knuth.  My hash function used a little bit from this (although
misused might also apply).

Regards,

Steve N.

Hi.

Well, I think the speed factor is the most important one.
So beside collisions on hash generation, it depends on how long
it takes to generate hashes on the fly while I'm extracting words
from a text file. If I extract 1 million words, I have to generate
1 million hashes, and confront the words extracted with the words
pointed by the hashes to verify collisions or equality of hash and words.

What can be done to have unique hashes on unique words?

Frank
There are only two days a year when you can't do anything: one is called yesterday, the other is called tomorrow, so today is the right day to love, believe, do and, above all, live.

Dalai Lama

qWord

Quote from: frktons on February 07, 2013, 02:55:29 AMWhat can be done to have unique hashes on unique words?
the same kind of question: how can I represent numbers between 0 to 10000000 in 8 Bit integers without information loss?
MREAL macros - when you need floating point arithmetic while assembling!

frktons

Quote from: qWord on February 07, 2013, 03:10:11 AM
Quote from: frktons on February 07, 2013, 02:55:29 AMWhat can be done to have unique hashes on unique words?
the same kind of question: how can I represent numbers between 0 to 10000000 in 8 Bit integers without information loss?

So it is impossible. Then we move to next step.  :P
What kind of hash generator can be used to have a reasonable
number of collisions? And if you have a MASM sample code to post,
much better.
There are only two days a year when you can't do anything: one is called yesterday, the other is called tomorrow, so today is the right day to love, believe, do and, above all, live.

Dalai Lama

FORTRANS

Quote from: frktons on February 07, 2013, 03:25:47 AM
What kind of hash generator can be used to have a reasonable
number of collisions? And if you have a MASM sample code to post,
much better.

Hi,

   Well for the LZW code used for *.GIF files, you need to
have up to 12-bit codes or tokens generated.  You emit
an existing token and a character to create the next token.
So I shift the character left four bits and add in the token to
create an index into the hash table.  You then use the index
to check for an empty slot, a filled slot with the character the
same as the one you are processing, or a filled slot with a
different character, which means a collision.  The code to
create the index is pretty simple.  (Mine is 16-bit though.)
The code that uses the index is not simple.

   You basically need to take your data and fold it up somehow
to create a proper sized index into the hash table.  GIF's just
happen to be relatively small, so you can use a simple algorithm.

Regards,

Steve
Regards,

hutch--

Frank,

> What can be done to have unique hashes on unique words?

The general trade off with a hash table is to use a much faster hash number generator and handle the collisions efficiently. The alternative is to use a much more complex number generator that is a lot slower but still have to handle a reduced number of collisions.

Next trick is to always set the has table count as a prime number so that any looping around the table never finds a recurring pattern. Rough rule is if you need a given number of items to load into the table, double the table size and you will never get saturation.

By design if you make the basic table as an array of pointers and for each slot use dynamic allocation, you end up with a very efficient table that will handle variable length data with no problems. If alternately you know the maximum length of the data you want to load into the table, you allocated a single block of memory and set your array of pointers to even locations within that memory. The latter is much faster but at the cost of being inflexible.

frktons

Quote from: FORTRANS on February 07, 2013, 04:35:53 AM

Hi,

   Well for the LZW code used for *.GIF files, you need to
have up to 12-bit codes or tokens generated.  You emit
an existing token and a character to create the next token.
So I shift the character left four bits and add in the token to
create an index into the hash table.  You then use the index
to check for an empty slot, a filled slot with the character the
same as the one you are processing, or a filled slot with a
different character, which means a collision.  The code to
create the index is pretty simple.  (Mine is 16-bit though.)
The code that uses the index is not simple.

   You basically need to take your data and fold it up somehow
to create a proper sized index into the hash table.  GIF's just
happen to be relatively small, so you can use a simple algorithm.

Regards,

Steve
Regards,

It'd be nice to see the code you use for generating hashes.
Being new to the matter, it could help me to get started.
;)
Quote from: hutch-- on February 07, 2013, 11:20:41 AM
Frank,


The general trade off with a hash table is to use a much faster hash number generator and handle the collisions efficiently. The alternative is to use a much more complex number generator that is a lot slower but still have to handle a reduced number of collisions.

Next trick is to always set the has table count as a prime number so that any looping around the table never finds a recurring pattern. Rough rule is if you need a given number of items to load into the table, double the table size and you will never get saturation.

By design if you make the basic table as an array of pointers and for each slot use dynamic allocation, you end up with a very efficient table that will handle variable length data with no problems. If alternately you know the maximum length of the data you want to load into the table, you allocated a single block of memory and set your array of pointers to even locations within that memory. The latter is much faster but at the cost of being inflexible.

Yes Steve, good points to consider.   :t
There are only two days a year when you can't do anything: one is called yesterday, the other is called tomorrow, so today is the right day to love, believe, do and, above all, live.

Dalai Lama