The BTREE Module: Balancing a Binary Tree

[Home] [ Next Page ]
[ References ]

[BTREE Module Source File: btree.c. ]
[BTREE Module Source File: btree_statistics.c. ]
[BTREE Module Source File: btree_balance.c. ]
[BTREE Module Public Header File: btree.h. ]
[BTREE Module Private Header File: btreep.h. ]

This discussion concerns adding a method to our BTREE module to balance a binary tree. The algorithm I will use is from your textbook, Data Structures and Program Design in C, Second Edition by Kruse, Tondo and Leung (Prentice Hall, New Jersey, 1997). The authors describe the algorithm in pretty good detail in Section 9.3, "Building a Binary Search Tree," so I will not attempt to describe the algorithm here beyond discussing the idiosyncrasies of my specific adaptation.

Since balancing a binary tree pretty much presupposes that we will be using the module to build a binary search tree, it makes sense that we also add a method to delete a node from the tree. The algorithm for doing so may be found in your textbook in Section 9.2.5, "Deletion from a Binary Search Tree." In addition, we will want to test and evaluate our implementation, so I will discuss a test driver for doing so; to facilitate this process we will incorporate into the module a facility for validating and analyzing a tree. Since part of the driver will require generating large numbers of random keys and timing certain operations, we will also need a random number generator and a timing tool, which I will describe in the appendices to this document.

Table of Contents

  1. Deleting a Node from a Binary Search Tree: The BTREE_delete_node Method
  2. Validating a Binary Tree: The BTREE_validate Method
  3. Analyzing a Binary Tree: The BTREE Module Statistics Facility
  4. Balancing a Binary Tree: The BTREE_balance Method
  5. Testing the Implementation
  6. Appendix A: The Timer Module
  7. Appendix B: The Rand Module
  8. Reference Pages

Deleting a Node from a Binary Search Tree: The BTREE_delete_node Method

The algorithm for deleting a node from a Binary Search Tree is given in your textbook, in Section 9.2.5 "Deletion from A Binary Search Tree." To summarize, the algorithm follows the this procedure:

[Note: In the following discussion D refers to the node to be deleted; L refers to the left child of the node to be deleted; R refers to the right child of the node to be deleted; and P refers to the parent of the node to be deleted.]

  1. If D is a leaf, we need simply remove it from the tree.
  2. If D has only a right or left child, D is removed from the tree, and its child is attached to the branch of P vacated by D.
  3. If D has both left and right children, D is removed from the tree, R is attached to the vacated branch of P, and a home is found for L by recursively descending the left subtree of R until we find an empty left branch (see your textbook for a more detailed explanation).
  4. Boundary conditions: special logic is required if D is the root node.

The code provided by the authors of your textbook consists of a subroutine to which is passed the address of a pointer to the node to be deleted; for example:

/* Delete parent's left child */
DeleteNodeTree( &parent->left );

Our implementation, however, will require that the user pass the address of the node to be deleted. Therefore the user will pass the ID of the node to be deleted, and our first job will be to go to the node's parent and find the address of the appropriate pointer:

    DeleteNode( node );
    . . .
void DeleteNode( BTREE__NODE_p_t node )
{
    BTREE__NODE_p_t temp    = NULL;
    BTREE__NODE_p_t parent  = node->parent;
    BTREE__NODE_p_t *hole   = NULL;

    /* Deleting the node is going to leave a hole in the parent node or,
     * if the deleted node is the root, the tree control structure. This
     * logic finds the address of the hole.
     */
    if ( parent == NULL )
    {
        CDA_ASSERT( node == node->tree->root );
        hole = &node->tree->root;
    }
    else if ( node == parent->left )
    {
        hole = &parent->left;
    }
    else
    {
        CDA_ASSERT( node == parent->right );
        hole = &parent->right;
    }
    . . .
}

The remainder of our logic pretty much follows the code in the textbook making allowances for the fact that our BTREE node structure differs from the authors' node structure in several details:

    . . .
    /* If the deleted node is a leaf, fill the hole with NULL; otherwise
     * if the deleted node's right subtree is empty, fill the hole with
     * the node's left subtree; otherwise fill the hole with the node's 
     * right subtree, and look for a new home for the left subtree.
     */
    if ( node->right == NULL )
    {
        *hole = node->left;
        if ( node->left != NULL )
            node->left->parent = parent;
    }
    else if ( node->left == NULL )
    {
        *hole = node->right;
        if ( node->right != NULL )
            node->right->parent = parent;
    }
    else
    {
        for ( temp = node->right ; temp->left != NULL ; temp = temp->left )
            ;
        temp->left = node->left;
        temp->left->parent = temp;

        *hole = node->right;
        node->right->parent = parent;
    }
    . . .
}

One final note: because of an optimization that will be discussed later (see Balancing a Binary Tree: The BTREE_balance Method ) we anticipate that in the future we may provide the user with more than one way to delete a node; to facilitate that, the main body of our delete logic is divorced from the public delete-node entry point:


BTREE_NODE_ID_t BTREE_delete_node( BTREE_NODE_ID_t node )
{
    /* put most of the deletion logic in a separate subroutine
     * because one day we might provide the user with a new method
     * that *recycles* a deleted node instead of destroying it.
     */
    delete_node( node );
    CDA_free( node );

    return BTREE_NULL_NODE_ID;
}

The complete implementation of this method is found in btree.c

Validating a Binary Tree: The BTREE_validate Method

The truth is there's not a whole lot we can do to validate a binary tree. Really, all we can do is to visit every node in the tree and verify that it points upwards to its parent. Still, after we get around to balancing a tree, it will be useful to be able to validate that we correctly re-parented all the tree's nodes.

So we're going to visit each node in the tree using a post-order traversal and make sure that each of the node's children point back to the node; special consideration will have to be made for the root node which doesn't have a parent. Along the way we will need to keep track of the number of the node that we are testing; knowing the number of a node that has an invalid parent will make debugging the error easier. And since we will be executing a recursive traversal of the tree, it will also help to keep track of the overall status of the operation. Which gives us the opportunity to introduce the idea of an operation context.

One way to keep track of the state of a recursive operation is via global variables. For example:

static CDA_BOOL_t   validation_result = CDA_TRUE;
static CDA_UINT32_t node_num          = 0;

static CDA_BOOL_t validate( BTREE__NODE_p_t node )
{
    . . .
    ++node_num;
    . . .
    if ( node->left != NULL )
        if ( node->left->parent != node )
            validation_result = CDA_FALSE;
    . . .
}

This approach, however, has a couple of problems, most notably it is not reentrant. And just what do we mean by that?

Suppose your application is composed of two threads of execution (that is, your application is divided into two pieces, each of which can execute independently). Further suppose that each thread is operating on a different binary tree. Now, say thread 1 has started a validation operation on its btree. Then thread 2 comes along, interrupts thread 1, and begins a validation operation on its binary tree. Well, both validation operations are going to be using the same global variables to keep track of their states, which means that at least one of the operations is going to get its state corrupted. [Note: A full understanding of the above argument is not crucial to understanding the rest of this discussion.]

One solution to this problem is to introduce a context in place of the global variables. This will be a struct unique to each operation that contains fields to track the operation's state. A reference to the struct will have to be stored somewhere that it can be associated with a specific tree; within our BTREE module, that would be somewhere in the tree's control structure. Later we will find that there are other operations that would benefit from a context; so we will add to the BTREE control structure a void* field to point to anything that may be appropriate to a particular operation. Our new control structure will look like this:

typedef struct btree__control_s
{
    BTREE__NODE_p_t root;
    void            *context;
}  BTREE__CONTROL_t;

And we will use it something like this:

typedef struct vcontext_s
{
    CDA_UINT32_t    node_num;
    CDA_BOOL_t      result;
} VCONTEXT_t, *VCONTEXT_p_t;
. . .

CDA_BOOL_t BTREE_validate( BTREE_ID_t tree )
{
    CDA_BOOL_t  rcode   = CDA_TRUE;
    VCONTEXT_t  ctx;

    ctx.node_num = 0;
    ctx.result = CDA_TRUE;
    tree->context = &ctx;

    validate( tree->root ); /* begin tree traversal */
    rcode = ctx.result;

    tree->context = NULL;
    return rcode;
}
. . .
static CDA_BOOL_t validate( BTREE_NODE_ID_t node )
{
    CDA_BOOL_t  rcode   = CDA_TRUE;

    if ( node != NULL )
    {
        VCONTEXT_p_t    ctx = node->tree->context;
        . . .
        ++ctx->node_num;
        . . .

        if ( !rcode )
            ctx->result = rcode;
    }
    . . .
}

The complete implementation of this method is found in btree_statistics.c. Anytime the method discovers an invalid node it will print an explanatory note to stderr.

Analyzing a Binary Tree: The BTREE Module Statistics Facility

When analyzing a binary tree mainly what we want to know is: 1) how many nodes does the tree contain; 2) is the tree balanced; and 3) if the tree isn't balanced, how far out of balance is it?

Answering question 1 is fairly easy. Answering questions 2 and 3, particularly 3, is a little more complicated. In order to answer question 2 we need to know the length of the longest path from root to leaf, and the length of the shortest path from root to leaf; if these values differ by more than 1 the tree is unbalanced.

Answering question 3 in detail requires looking ahead a bit in your textbook to the section on balancing a tree. The short answer is: we need to know now many leaf nodes there are, and the average length of a path from root to node. How to use these values is a subject of a later discussion; how to obtain them is what we're concerned with here. Specifically, we propose adding to the BTREE module a facility which consists of the following methods:

BTREE_compute_stats
This method will compute a variety of statistics; each statistic may be obtained by calling one of the methods discussed below. Note that the result returned by any of the statistics interrogation methods is undefined if BTREE_compute_stats is not called first. Any time the tree is changed BTREE_compute_stats must be called again in order to update the statistics.

BTREE_get_num_nodes
This method will return the number of nodes resident in the tree as of the last time BTREE_compute_stats was called.

BTREE_get_num_leaves
This method will return the number of leaves resident in the tree as of the last time BTREE_compute_stats was called.

BTREE_get_shortest_path
This method will return the length of the shortest path from root node to leaf as of the last time BTREE_compute_stats was called.

BTREE_get_longest_path
This method will return the length of the longest path from root node to leaf as of the last time BTREE_compute_stats was called.

BTREE_get_avg_path
This method will return the average length all paths from root node to leaf as of the last time BTREE_compute_stats was called.

BTREE_get_optimum_leaves
This method will return the optimum number of leaves that would be resident in the tree given the total number of nodes as of the last time BTREE_compute_stats was called. Looking ahead in your textbook to the discussion of creating a balanced tree, you will see that this is always half the total number of nodes in the tree.

BTREE_get_optimum_path
This method will return the optimum path length from root to node given the total number of nodes as of the last time BTREE_compute_stats was called. This is always the log (base 2) of the total number of nodes in the tree.

BTREE_get_optimum_visits
This method will return the average number of visits to complete a successful search of a binary tree under optimal conditions, given the total number of nodes as of the last time BTREE_compute_stats was called. This figure is computed by determining the number of levels in the tree (which is the same as the optimum path length) and calculating the number of nodes that are present at each level. For example, a tree with 15 nodes will have 4 levels, with 8 nodes in the fourth (leaf) level, 4 nodes in the third level, 2 nodes in the second level and one node in the first (root) level; therefore 8 searches will require visiting 4 nodes, 4 searches will require visiting 3 nodes, 2 searches will require visiting 2 and 1 search will need to visit just one node. Overall, under optimal circumstances, a tree with fifteen nodes will satisfy a successful search making an average of ((8 * 4) + (4 * 3) + (2 * 2) + (1 * 1)) / 15 equals (approximately) 3.267 visits.

Our implementation strategy for BTREE_compute_stats will be as follows:

The complete implementation of this facility is found in btree_statistics.c

Balancing a Binary Tree: The BTREE_balance Method

The algorithm that we will use to balance a tree is found in your textbook in Section 9.3, "Building a Binary Search Tree." I will not try to explain the algorithm or it's implementation, except where our specific code departs from the authors' (or, in one case, where there's a bug in the authors' code).

The first thing you should note about the algorithm is, as the authors explain, that it does not always create a perfectly balanced tree. The second thing is that the authors' discuss building a balanced tree from scratch, using an anonymous source that provides tree nodes already in an ordered sequence. Since we are starting with an existing tree, we will get our ordered sequence of nodes by simply doing an in-order traversal of the tree:

/* pseudocode for in-order traversal; details omitted */
traverse( BTREE__NODE_p_t node )
{
    if ( node != NULL )
    {
        traverse( node->left )
        insert_node( node )
        traverse( node->right )
    }
}

To keep track of the node count and lastnode array of the authors' algorithm I chose to use a context like the one introduced above, in Validating a Binary Tree: The BTREE_validate Method ; for the size of the lastnode array I arbitrarily chose a size of 30, giving us the ability to balance a tree with over one billion nodes (I'm unhappy with this arbitrary selection, as I always am with arbitrary choices; we should probably count the number of nodes in the tree, and create an array of an appropriate size, but, at least at the time of original implementation, I was also unhappy with adding that much overhead to the algorithm):

typedef struct context_s
{
    int                 count;
    BTREE__NODE_p_t     nodes[30];
} CONTEXT_t, *CONTEXT_p_t;

BTREE_ID_t BTREE_balance( BTREE_ID_t tree )
{
    int         inx     = 0;
    CONTEXT_t   context;

    context.count = 0;
    for ( inx = 0 ; inx < (int)CDA_CARD( context.nodes ) ; ++inx )
        context.nodes[inx] = NULL;
    tree->context = &context;
    traverse( tree->root );
    . . .
}

Our adaptation of the authors' algorithm will require us to: a) use the in-order traversal to find the next node to insert; b) create a new node, copy the data out of the old node, and insert the new node into the balanced structure; and c) dispose of the old node when it is no longer needed. "Disposing" of the old node could well mean simply destroying it; that, however, would introduce an enormous amount of churn into the heap management routines (malloc and free), and decrease the efficiency of the algorithm greatly. So we will introduce the concept of "recycling" a node; a node that is no longer needed will be placed in a used-node list, and, when a new node is needed we will try to find one in the used-node list before allocating a brand new one.

The Recycling Facility

Implementing the recycling facility will require the following:

  1. One final modification to the BTREE control structure, to add a field to serve as head of the used-node list:
    /* "unused" will be the head of the used-node list */
    typedef struct btree__control_s
    {
        BTREE__NODE_p_t root;
        BTREE__NODE_p_t unused;
        CDA_UINT32_t    num_nodes;
        CDA_UINT32_t    num_leaves;
        CDA_UINT32_t    shortest_path;
        CDA_UINT32_t    longest_path;
        double          avg_path;
        void            *context;
    }  BTREE__CONTROL_t;
  2. The addition of a private method, BTREE__recycle_node, to recycle a subtree instead of destroying it. The new method needs to be private (as opposed to local) because it needs to be accessed from both btree.c and btree_balance.c (note that a prototype will need to be added to btreep.h). It will put the nodes from the deleted subtree into a singly linked list anchored by the unused field of the control structure, and utilizing a node's right field as the link to the following node in the list:
    void BTREE__recycle_node( BTREE__NODE_p_t node )
    {
        BTREE__CONTROL_p_t  tree    = node->tree;
    
        if( node->left != NULL )
            BTREE__recycle_node( node->left );
    
        if( node->right != NULL )
            BTREE__recycle_node( node->right );
    
        remove_node( node );
        node->right = tree->unused;
        tree->unused = node;
    }
  3. The addition of a private method, BTREE__new_node, that will obtain and initialize a new node. If the used-node list is nonempty the new node will come from the from the used-node list, otherwise a new node will be allocated from the heap:
    BTREE__NODE_p_t BTREE__new_node( BTREE__CONTROL_p_t tree, void *data )
    {
        BTREE__NODE_p_t node    = tree->unused;
    
        if ( node != NULL )
            tree->unused = node->right;
        else
            node = CDA_NEW( BTREE__NODE_t );
                                         
        node->data = data;
        node->tree = tree;
        node->parent = NULL;
        node->left = NULL;
        node->right = NULL;
    
        return node;
    }
  4. Other minor changes to the existing implementation need to be made, specifically: any method that formerly manufactured a new node from the heap, for example BTREE_add_left, will now call BTREE__new_node to obtain the node; and administrative details common to both BTREE__recycle_node and BTREE_destroy_subtree will be removed to a local subroutine, remove_node.

The complete set of revisions needed to implement the recycling facility may be found in btreep.h and btree.c.

The Balancing Logic

The code for balancing a tree will be placed in the file btree_balanced.c. Our adaptation of the authors' BuildTree subroutine is contained in BTREE_balance and the subroutine traverse:

BTREE_ID_t BTREE_balance( BTREE_ID_t tree )
{
    int         inx     = 0;
    CONTEXT_t   context;

    context.count = 0;
    for ( inx = 0 ; inx < (int)CDA_CARD( context.nodes ) ; ++inx )
        context.nodes[inx] = NULL;
    tree->context = &context;
    traverse( tree->root );
    CDA_ASSERT( tree->root == NULL );

    connect_subtrees( tree );

    tree->context = NULL;

    return tree;
}

static void traverse( BTREE__NODE_p_t node )
{
    if ( node != NULL )
    {
        traverse( node->left ); 
        insert_node( node );                   
        traverse( node->right );
        CDA_ASSERT( BTREE_is_leaf( node ) );
        BTREE__recycle_node( node );
    }
}

Our version of the insert subroutine will depart from the authors' mainly in the details we require to maintain the parent field of a node:

static void insert_node( BTREE__NODE_p_t node )
{
    CONTEXT_p_t     context = node->tree->context;
    BTREE__NODE_p_t *nodes  = context->nodes;
    int             level   = get_level( ++context->count );
    BTREE__NODE_p_t next    = BTREE__new_node( node->tree, node->data );

    next->left = nodes[level - 1];
    next->right = NULL;
    next->data = node->data;
    next->tree = node->tree;
    next->parent = NULL;
    nodes[level] = next;

    if ( next->left != NULL )
        next->left->parent = next;

    if ( nodes[level + 1] != NULL )
    {
        BTREE__NODE_p_t temp = nodes[level + 1];
        if ( temp->right == NULL )
        {
            temp->right = next;
            next->parent = temp;
        }
    }
}

Finally, we will combine the authors' subroutines FindRoot and ConnectSubtrees into a single subroutine. As you can see from the code, below, we will make a few minor changes to the logic. The first (see boldface text) is required in order to properly identify the root of the tree (in the authors' code this is done via an independent subroutine), and the second (see boldface text) is required to fix a bug in the connection logic. Additional changes are required to maintain our node's parent field:

static void connect_subtrees( BTREE__CONTROL_p_t tree )
{
    CONTEXT_p_t     context     = tree->context;
    BTREE__NODE_p_t *nodes      = context->nodes;
    BTREE__NODE_p_t node        = NULL;
    int             inx         = 0;

    for ( inx = CDA_CARD( context->nodes ) - 1  ; 
          inx > 0 && nodes[inx] == NULL         ; /* Kruse: inx > 2 */
          --inx
        )
        ;

    tree->root = nodes[inx];
    while ( inx > 2 )
    {
        if ( nodes[inx]->right != NULL )
        {
            --inx;
        }
        else
        {
            int jnx = inx - 1;
            node = nodes[inx]->left;
            do
            {
                node = node->right;
            } while  ( node != NULL && node == nodes[--jnx] );

            /* Kruse: if ( jnx > 1 ) */
            if ( jnx > 0 & & nodes[jnx]->parent == NULL )
            {
                nodes[inx]->right = nodes[jnx];
                nodes[jnx]->parent = nodes[inx];
            }
            --inx; /* Kruse: inx = jnx; */
        }
    }
}

The complete details of the implementation may be found in btree_balance.c.

Next: Testing the Implementation.


Copyright © 2005 by Jack Straub
jstraub@centurytel.net