Wednesday 9 May 2012

Using arrays for better analysis of nodes in a tree

Prologue

Let's start with the basics..
This post is written with the pre-supposition that the readers know about the theoretical concept of trees, storage of nodes of trees using lists and Depth-first search (DFS) used for analyzing those nodes...
[ If you are not aware of this terms, I suggest you read this book on Algorithms 
Also, keep Wikipedia handy if it appears a bit advanced...
Even if you know about these, still it's a good read..
As far as I can remember, it was allowed to be freely distributed; so you shouldn't hesitate downloading it. 
For quick learning, you may start with 3rd chapter named " Decompositions of graphs" and after completing page 95, I think you'll get the context...]

Context

As we all know,  "tree" is a theoretical concept in Computer Science and hence, 
data structures like lists or matrices are used for storing these trees.
Now, all that is fine, but how can you find the relation ( like parent, child, descendant) between two nodes?
For eg. consider this -> 
A company has about 500 webpages in their website.
The webpages are inter-connected among each other, 
but without any looping, for easy browsing of the reader.
Consider that they have maintained a site map as a list.
Now, someone needs to find (separately for each) if 100 of those pages (Group.A) links to another 100 pages (Group B).
So, using a normal DFS search, it'll take 100x100 searches which is not all feasible.
Even an optimized DFS search, which checks the relation between one webpage of say Group.A, with all the webpages in Group.B takes 100 DFS search and each of them needs to loop through many of the 5oo pages to find the required relation..

Can the process be optimized further?

Of course, we can...
Actually, we don't need that much of DFS search (or, just DFS, whatever)...
Whenever  DFS is used a good number of times times,it is bound to search through the same nodes which has been already searched in a previous DFS.
This optimization, on the other hand, increases the speed of the whole process, by sacrificing some storage space.

Ok, just show me how to optimize it... (The main concept)

What I plan is to run the DFS through the whole list  (i.e. all the nodes) only once..
And use a 2-D arrays (with the number of elements being equal to number of nodes) to speed up the later searches...

The DFS search is to be customized such that for every level, when the explore() function :gets called
1. The node address is used to fill up the positions of the first column array sequentially
2. The corresponding second column (of the same row) is filled up with an integer which is decided by the
    following rule:
  • If it is the source, then the number used is 1.
  • For every next level of the tree, zero (0) is apprehended to the end of the number that its immediate parent/ancestor was given.
  • For every node, a number is added (in the end of the number previously obtained), which is equal to the sequential order of that node in that level in which it is accessed, taking only those nodes in account which have the same parent (not ancestor).
For eg, see the below  illustration:
1. The head node is given 1.
 
2. When the next level is accessed, a zero is added. So, it becomes 10.
 
3. Now, for the first accessed node in the second level, 1 is added at the end of 10; for the second accessed node in the second level, 2 is added at the end of 10;  and so on...
 
4. Now for the node in the third level which is the child of the node marked "102", at first zero is added to signify the next level and then 1 is added to show that it is the only child of "102" and hence it the child 
    which is accessed first... So, the number becomes "10201"
 
This continues similarly...
   


How is it gonna help me?

Now that the variable is assigned to each node, let's see how this numbers are gonna help us:

Note: This method is applicable only when the number of child of any parent is less than 100, which is acceptable for general purposes... Also, here acceptable zero refers to that zero which doesn't have zero after it and also isn't the last digit of the number... For. eg, if the head node had 40 sub-node, then the 1st child of the 30th sub-node will have a number "103001" : Here the coloured zero is not acceptable zero by definition, which actually signifies 0 of the 30th node..

1. The number with higher number of acceptable zero is of a higher level.
2. The numbers of same number of acceptable zero are of same level.
3. If between two numbers, all  the digits of the smaller number are same as the corresponding digits of the 
    larger number and 
    a) the larger number has no more acceptable zeros than the smaller one ->
        then the smaller number is the parent of the larger number.
    b) the larger number has more acceptable zeros than the smaller one ->
       then the smaller number is the ancestor of the larger one.

Thus, we can get all this information which are required for any type of relation between two nodes, without running the DFS multiple times, and using only one variable per node.

The End

Actually, I was reading that book called Algorithms (I mentioned at the Prologue) and I thought that this topic will bring a change of taste and so, here it is...
If you have any suggestions or want to have a small debate, feel free to comment.....

No comments:

Post a Comment