Tree Sets

The SetMap Application

The examples in this and other documents in this section all part of the Java Application SetMap. Download the source archive SetMap.zip and install it as a Java Application with Existing Sources. See the Using NetBeans document for details.

Like others, the SetMap project has multiple Main Classes intended to illustrate various independent features. The simplest way to run the various classes is to right-click the file and select Run File either in the file itself or from the listing in the Projects window.

The Java files from the project which are referenced in this document are:
adapter/
  SetAdapter
  NavSetAdapter
util/
  SearchTreeSet
demo/
  TreeDemo1
  TreeDemo2
  SearchTreeTraversals
  SearchTreeRemovals

Sets

The basic notion of a Set is that it is simply a Collection which does not store duplicates. A Collection has no notion of of position (like a List) and no notion of first and last (like a Deque). It only expresses aggregate properties with the operations add, remove, contains, size, isEmpty, etc.

There are three Set implementation classes of interest to us: The key ingredient is that a Set contains no duplicate elements. What exactly constitutes a "duplicate" actually depends on the implementation: If an element is a duplicate, the add operation, attempting to add it will returns false and the element is not added.

An ArrayList or a LinkedList could be used to implement a Set. The add(elt) member function would have to scan the list searching for to avoid adding a duplicate. If the list is an unordered ArrayList, or a LinkedList, the search for duplicates requires linear time, or O(n), time.

Alternatively, we could maintain an ArrayList with elements in sorted order. In this case the duplicate search cost is only O(log(n)) time, but it still takes O(n) data movements to actually add the element.

We will see that we can do much better, making add/remove/contains all work in sub-linear time.

Ordered Set interfaces

A stronger version of the Set<E> interface is the SortedSet interface which assumes an ordering of the elements. This interface provides functions for computing the first and last elements and can keep track of the range between any two elements.

Even stronger is the NavigableSet interface in which we can locate elements in the set which are "near to" some element. A NavigableSet<E> also allows the removal of the first and last elements.

Simple TreeSet Implementation

The Set implementation of interest in this document is the TreeSet class which implements the NavigableSet interface. Its elements are distributed in a search tree structure of some kind which permits an efficient O(log(n)) search. The simplest form of a search tree is a binary search tree, which is a binary tree that maintains the following ordering property at each node:
data values
of all nodes in
the left subtree
of node n
< data value
at node n
< data values
of all nodes in
the right subtree
of node n
Here is an example of a tree with integer values:
When working with trees, you should become familiar with the terminology:
root,
children,
a descendant of a node (a path exists from the node to it),
subtree (a node and all of its descendants),
leaf (a node with no children),
interior node (a non-leaf),
a node's level or depth (the path-distance from the root),
the full tree's height (the maximum depth of any node),
a nodes's height (the height of the subtree at that node),
a perfect tree (all interior nodes have two children & all leaves at same level)
For example, some facts about the above tree:
node 20, the root, is at level 0, 
         it has height 3, 
         it has two children: nodes 15 and 30
node 25  is at level 2, 
         it has height 1, and is a perfect subtree
node 17  is a leaf at level 3 and has height 0

Comparability

Any SortedSet<E>, whatever the implementation may be, must somehow have the ability to determine an ordering of the elements. This usually means that the element type implements the Comparable<E> interface, implying that the compareTo function is defined:
                    < 0  means e1 is less that e2
e1.compareTo(e2)    = 0  means that e1 equals e2
                    > 0  means that e1 is greater than e2
Alternatively, a Comparator<E> object can be used to define the relation between two elements. A Comparator<E> object, cmp, is one which defines the compare function. Using such an object we can defined the relation between two elements:
                    < 0  means e1 is less that e2
cmp.compare(e1,e2)  = 0  means that e1 equals e2
                    > 0  means that e1 is greater than e2

TreeSet operation restrictions

Because of the need for comparability, the TreeSet<E> class does not support adding null — it throws an exception. Disallowing null values is not a restriction across all Set<E> implementations. For example, it is OK to store null into a HashSet<E>. Another restriction is that the remove(Object obj) function will fail if called with an object which is not of the element type in use.

Demo Project

Here are the two relevant demo classes:
demo.TreeDemo1  
and
demo.TreeDemo2  
Test-run both of them. The first makes use of a Comparator object defined like this: Comparator object defined like this:
Comparator<Integer> cmp = new Comparator<Integer>() {
  public int compare(Integer lhs, Integer rhs) {
    String lhs_str = lhs.toString();
    String rhs_str = rhs.toString();
    return lhs_str.compareTo(rhs_str);
  }
};
The object cmp is created using an anonymous inner class. Comparator is an interface which requires the definition of the function compare. Java creates a class which implements Comparator on the fly and makes cmp an instance of this class. The class is local to the mainSetOps function and is never given a name.

The second demo function, mainSortedSetOps illustrates advanced Set operations used in the SortedSet and NavigableSet interfaces.

TreeSet Implementation

We're going to use our own implementation of the TreeSet called SearchTreeSet. This is created is the same manner as in Array Lists

As with the AListAdapter, LListAdapter, and DequeAdapter we create a vacuous basis of the TreeSet class from which we will derive our user-defined version.

The TreeSet only needs to implement the NavigableSet interface, and so we call our class NavSetAdapter. As a useful variation, we will write both a simpler SetAdapter class and make NavSetAdapter an extension. This will serve several purposes later down the line as well.
adapter.SetAdapter  
and
adapter.NavSetAdapter  
The implementation class is:
util.SearchTreeSet  

Initial test-drive

To verify that SearchTreeSet "works" like TreeSet (to some degree), replace the three occurrences of TreeSet by SearchTreeSet in the TreeDemo1 program:
//    Set<String> str_set = new TreeSet<>();
//    Set<Integer> int_set = new TreeSet<>();
//    Set<Integer> int_set_lex = new TreeSet<>(cmp);
 
    Set<Integer> int_set_lex = new SearchTreeSet<Integer>(cmp);
    Set<String> str_set = new SearchTreeSet<>();
    Set<Integer> int_set = new SearchTreeSet<Integer>();

SearchTreeSet Discussion

Our implementation uses a simple binary tree constructed with nodes defined by this inner class:
private class Node<E> {
  E data;
  Node<E> left, right;
 
  Node(E data, Node<E> left, Node<E> right) {
    this.data = data;
    this.left = left;
    this.right = right;
  }
}
The data members of our class are these:
private Node<E> root = null;
private int size = 0;
private Comparator<? super E> cmp = null;  // passed into constructor

Comparability

We remarked above that a key feature of a TreeSet is the ability to order the elements. Thus they must either be In the former case the internal private data member cmp is the default null value. In the latter case, we would have
cmp = my_cmp
making cmp a non-null value. In either case, we use a private helper function, myCompare, to do the right thing:
private int myCompare(E lhs, E rhs) {
  if (cmp == null) {
    // no Comparator passed into constructor,
    // assume element type, E, is Comparable
    return ((Comparable)lhs).compareTo(rhs);
  } 
  else {
    // use the Comparator to compare elements of type E
    return cmp.compare(lhs, rhs);
  }
}

The textbook's approach

The textbook's class declaration of the search tree is like this:
public class SearchTreeSet<E extends Comparable<? super E>> {
  ...
}
which implies that its element type must be Comparable! This implementation unnecessarily restricts the element types a search tree can hold, a restriction not made by the java.util.TreeSet class in which we can initialize a TreeSet with Comparator for a non-Comparable type.

The contains function

The contains function generalizes what we would do with a singly-linked list. Keep in mind that the Java API requires the parameter type to be Object even though we only ever should use the E type. We can write it iteratively:
public boolean contains(Object obj) {
  E elt = (E) obj;
  Node<E> n = root;
  while (n != null) {
    int comp = myCompare(elt, n.data);
    if (comp < 0) {
      n = n.left;
    } 
    else if (comp > 0) {
      n = n.right;
    } 
    else {
      return true;
    }
  }
  return false;
}
or recursively:
public boolean contains(Object obj) {
  E elt = (E) obj;
  return contains(root, elt);
}
 
private boolean contains(Node<E> n, E elt) {
  if (n == null) {
    return false;
  }
  int comp = myCompare(elt, n.data);
  if (comp < 0) {
    return contains(n.left, elt);
  } 
  else if (comp > 0) {
    return contains(n.right, elt);
  } 
  else {
    return true;
  }
}

The add function

The add function:
boolean add(E elt) { ... }
must return false if elt is already in the tree, and true if not, in which case it will add the element into a new leaf node. The leaf node being "at the end of a chain" gives creates the impression that we might employ the strategy of the List addLast function; however, the analogy is not so simple because of the 2-way branching.

We will follow the idea of the recursive add as in the textbook. The basic idea is to employ a private recursive helper function which is called from the public add function like this:
root = add(root, elt);

Reference parameters (non-existent in Java)

Other languages like C++ have reference parameters for functions which provide the ability to modify a variable argument. If this existed in Java, we could write the recursive helper like this:
boolean add(Node n, E elt);
which is called at the root as follows:
boolean found = add(root, elt);
and the recursive helper function would be written like this (this is not java!!!):
private boolean add(Node & n, E elt)   // n is a "reference" parameter 
{
  if (n == null) {
    n = new Node(elt, null, null);  // reset n's value
    return false;
  }
  int comp = myCompare(elt, n.data);
  if (comp < 0) {
    return add(n.left, elt);
  } 
  else if (comp > 0) {
    return add(n.right, elt);
  } 
  else {
    return true;
  }
}
We cannot proceed in this way because Java is a simple language when it comes to parameter passing — there is only one way to pass a parameter. Despite the drawbacks of this limitation, the up side is that you will avoid the many potential mistakes which come along with parameter-passing flexibility.For example, in C++, there are 5 viable ways to pass an object as a parameter, and it's quite tricky to do the right thing unless you're an expert.

Simulate a reference parameter

The Java language only provides pass-by-value, meaning that the function only ever works with a copy of the variable, never permitting a change to the original variable's value. If we want a function to have an effect on a variable, we must, in effect, simulate a reference parameter. Our choices are these:
  1. Pass a variable to the function (i.e., pass its value); return an updated value and reset the variable from the return value. The call will look like this:
    SomeType some_var = foo( some_var, ... );
    
    The variable some_var passes its value into the function, the function computes and returns a new value, and then the assignment statement resets some_var to the new value.
  2. Pass an object containing the variable as member data and reset the value using permitted member operations.

Java-valid binary search tree add

At first, we want to ignore the issue of reporting whether the element was found or not and concentrate on simulating the Node reference parameter. We write (as in the textbook):
private Node<E> add(Node<E> n, E elt) {
  if (n == null) {
    return new Node<>(elt, null, null);
  }
 
  int comp = myCompare(elt, n.data);
 
  if (comp < 0) {                 // go left
    n.left = add(n.left, elt);    // modify subtree on left
    return n;
  }
  else if (comp > 0) {            // go right
    n.right = add(n.right, elt);  // modify subtree on right
    return n;
  }
  else {         // equal, do not add
    return n;    // no changes to n
  }
}
Duplicate elements are not added, but we currently have no report of whether the element was found or not. Given this version of add, the initial call within the public add function should be:
root = add(root, elt);
You can see this style repeated in each of the recursive calls:
n.left = add(n.left, elt);
and
n.right = add(n.right, elt);
The proof that this Java-based algorithm works is by induction. Assuming root is the top of a binary search tree, the claim is that the statement
n = add(n, elt);
correctly inserts elt into the search tree rooted at Node n
if n == null, clearly it works
if n != null, and elt is found at n, it works also
otherwise, the algorithm correctly calls one of these two:
n.left = add(n.left,elt)    or     n.right = add(n.right,elt)
which, by the inductive assumption, must work, and in that case n itself is not changed.

Determining the add return value using contains

Java's public add function needs to know whether the private add function actually added the element or not because: The discovery of whether the element to be added already exists could be done by an initial pass down the tree using the contains function with add written as follows:
public boolean add(E elt) {
  if (this.contains(elt)) {
    return false;
  }
  root = add(root, elt);
  ++size;
  return true;  // successful add if not found
}
 
private Node<E> add(Node<E> n, E elt) {
  if (n == null) {
    return new Node<>(elt, null, null);
  }
 
  int comp = myCompare(elt, n.data);
 
  if (comp < 0) {                 // go left
    n.left = add(n.left, elt);    // modify subtree on left
    return n;
  }
  else if (comp > 0) {            // go right
    n.right = add(n.right, elt);  // modify subtree on right
    return n;
  }
  // we never see "equals" in this scenario
}
Of course, doing so creates a doubling of execution time whenever a new element is added (which is likely to be most of the time).

Timing Considerations

The add function as well as contains are considered to be logarithmic time, i.e. O(log(n)). This of course, assumes that the tree is "well balanced" in some sense, however, in any realistic implementation, this will be the case.

What can specifically be said is this: the average time to do add or contains in an average binary tree is O(log(n)).

Traversals / Displaying Structure

Obtaining information about the contents of a tree structure is usually accomplished by traversing the tree in one of several standard methods:
  1. preorder traversal (root, left, right):
     visit the root
     visit the left subtree recursively
     visit the right subtree recursively
    
  2. inorder traversal (left, root, right):
     visit the left subtree recursively
     visit the root
     visit the right subtree recursively
    
  3. postorder traversal (left, right, root):
     visit the left subtree recursively
     visit the right subtree recursively
     visit the root
    
  4. levelorder traversal (this is not recursive)
     visit the root
     visit the children of the root
     visit the children's children   
     ...
    
In the simplest form, each of the first three traversals is easy to implement as follows:
preorderOutput(root);

// using the function ...
void preorderOutput(Node n) {
  if (n == null) {
    return;
  }
  System.out.println(n.data);   // root
  preorderOutput(n.left);        // left
  preorderOutput(n.right);       // right
}
inorderOutput(root);

// using the function ...
void inorderOutput(Node n) {
  if (n == null) {
    return;
  }
  inorderOutput(n.left);        // left
  System.out.println(n.data);   // root
  inorderOutput(n.right);       // right
}
postorderOutput(root);

// using the function ...
void postorderOutput(Node n) {
  if (n == null) {
    return;
  }
  postorderOutput(n.left);        // left
  postorderOutput(n.right);       // right
  System.out.println(n.data);   // root
}
Consider the effect on the following binary search tree:
One can validate that using the above inorder algorithm would print out the elements in their correct comparable order (one per line):
inorder listing: 5 15 17 18 20 22 25 27 30
In order to reveal the structure of the tree you also want to add indentation corresponding to a node's level. To do this, modify the code as follows:
inorderOutput(root,0);

// using the function ...
void inorderOutput(Node n, int level) {
  if (n == null) {
    return;
  }
  inorderOutput(n.left, level+1);
  // print indentation corresponding to current level
  System.out.println(n.data);
  inorderOutput(n.right, level+1);
}
Assuming that "print indentation corresponding to current level" means print 3*level blanks, the output would be this:
      5
   15
          17
      18
20
          22
      25
          27
   30
This yields a pretty good correspondence to the actual tree, but it is easier to "see" the tree if one used a reverse-inorder printing corresponding to a "right, root, left" visitation:
reverseInorderOutput(root,0);

// using the function ...
void reverseInorderOutput(Node n, int level) {
  if (n == null) {
    return;
  }
  reverseInorderOutput(n.right, level+1);
  // print indentation corresponding to current level
  System.out.println(n.data);
  reverseInorderOutput(n.left, level+1);
}
Compare the output of this function with the actual tree:
   30
          27
      25
          22
20
      18
          17
   15
      5
Although generally not as useful for visualization, the preorder and postorder traversal methods are nevertheless very important in tree-processing algorithms. Here are their listings of our sample tree:
preorder listing:  20 15 5 18 17 30 25 22 27
postorder listing: 5 17 18 15 22 27 25 30 20

Test program

The following demo program creates a random binary search tree with up to 10 elements and then calls the implementation-dependent traverse/output functions
public String inorderOutput();
public String preorderOutput();
public String preorderOutput();
public String reverseInorderOutput();
Here is the program:
demo.SearchTreeTraversals  

Timing Considerations

For any of the traversals on a tree with N nodes, we can prove the following:
the number of function calls = 1 + 2*N
The proof is by induction. It is easy to verify for N=0 or N=1. Then for N ≥ 1 write:
N = 1+L+R
where
L = # nodes in left subtree
R = # nodes in right subtree
Considering the initial call to the node as 1, we can see that
total calls = 1 + total calls on left side + total calls on right side
By induction,
total calls on left = 1+2*L
total calls on right = 1+2*R
giving us:
total calls = 1 + (1+2*L) + (1+2*R) = 1 + 2*(1+L+R) = 1 + 2*N
In particular, we say that traversals are linear time, or O(n) in terms of the number, n of nodes.

The remove function

Removing an element from a search tree, although tricky, is conceptually straight-forward with one (common) exception: removing the element at a node with two non-null children. In this case, the solution is either: In either case the search tree's order structure is preserved.

removeMin and removeMax

The idea of removeMin from a given node is simple: navigate to the leftmost node; capture the value, delete the node. Unfortunately, Java's lack of a reference parameters causes problems. The first step is to understand the removal operation without capturing the value. To do removeMin from the right subtree of Node n, we would execute:
n.right = removeMin( n.right );
using the recursive function:
private Node<E> removeMin( Node<E> n ) {  // assumes n != null
  if (n.left == null) {
    return n.right;
  }
  else {
    n.left = removeMin( n.left ); 
    return n;
  }
}
Here is a depiction of how this would be used in the remove function. Suppose we do remove(20) and this calls removeMin on the right:
p.right = removeMin(p.right);

In order to capture the minimum value, we can, as is done in the textbook, call two operations, one to find the minimum value, a second to remove it. Here is the findMin function:
private E findMin(Node<E> n) { // assumes n != null
  if (n.left == null) {
    return n.data;
  } 
  else {
    return findMin(n.left);
  }
}

The private remove helper function

Like the add operation, remove needs to know whether the element was found or not in order to provide the correct return value. As with add, we can use contains function to first discover if the remove operation is necessary.

Like add, we navigate down from the root looking for an occurrence of elt. The base case is an empty tree in which we do nothing. If the element is found, then the "hard" case is when the node has both children. At this point we can choose to do either removeMin or removeMax. There are two reasonable ways to proceed: The textbook uses the former alternative and notes the potential "skewing" effect which can occur with certain add/remove operation sequences. The code we use effects the latter alternative.
private final Random flipACoin = new Random();
 
private Node<E> remove(Node<E> n, E elt) {
  if (n == null) { // elt not in tree: you should never get here if you
    return null;   // call contains prior to using this helper function
  }
  int comp = myCompare(elt, n.data);
  if (comp < 0) {
    n.left = remove(n.left, elt);  // go left recursively
    return n;                      // n is unchanged
  }
  if (comp > 0) {                
    n.right = remove(n.right, elt); // go right recursively
    return n;                       // n is unchanged
  } 
  // else, comp == 0, Node n contains elt
  if (n.left == null) {
    return n.right;  // remove the node containing elt
  }
  if (n.right == null) {
    return n.left;   // remove the node containing elt
  } 
  // the node containing elt has 2 non-null children
  boolean choose_min = (flipACoin.nextInt(2) == 1); 
  if (choose_min) {
    n.data = findMin(n.right);  // replace n.data with value on right
    n.right = removeMin(n.right);  // remove node with value on right
  } 
  else {
    n.data = findMax(n.left);  // replace n.data with value on left
    n.left = removeMax(n.left);   // remove node with value on left
  }
  return n;  // meaning n (holding removed value) is still there
}
The textbook points out in section 4.3.5 that always choosing, say, "removeMin on the right" can cause a "left-skewing" effect. The following diagram indicates the outcome of O(n2) deletions from a random tree created from O(n) initial insertions:
Figure 4.27 from
Weiss textbook
Other alternatives might be to try to make a "best" choice between removeMin or removeMax based on, say, the height of the left and right subtrees or the number of nodes within; however, to make such a modification efficient you would have to hold this extra information in a field in the Node class and operations would have to maintain this field correctly.

The public remove operation

The public remove member function uses the following code:
public boolean remove(Object obj) {
  E elt = (E) obj;
  if (this.contains(elt)) {
    root = remove(root, elt);
    --size;
    return true;  // successful remove if found
  }
  return false;
}
The Java prototype of remove(Object obj) forces us to cast obj to the generic type E. This which will generate an exception if obj is not of type E to begin with! You could avoid the exception by testing the type using "obj instanceof E", but Java's java.util.TreeSet implementation appears to prefer throwing the cast exception.

Note the semantic difference between the return value for add and remove in that finding the element returns false for add and true for remove.

Timing Considerations

The remove, removeMax and removeMin functions, like add and contains can all be done with one or two passes down the tree, and thus are also considered to be logarithmic time.

Right and wrong way to do removal

What make removal complex is how a value is removed in a node with two non-null children. In this case, we do not actually remove the node containing the value. Take our example tree
The two correct removals of 20 would be either one of these:
We could, in theory, remove the node with 20, and then take the right side and "attach it" to the left side, or vice-versa, thus getting the following two removals of 20:
Even though this procedure effects a systematic removal of 20, the operation has increased the tree height, making subsequent operations more expensive.

Test program

The following program, SearchTreeTraversals, creates a "random" tree of integers and removes one of these chosen randomly.
demo.SearchTreeRemovals  


© Robert M. Kline