9. Hash Sets

The SetMap Application

The examples in this and other documents in this section all part of the Java Application SetMap. Download the source archive SetMap.zip and install it as a Java Application with Existing Sources. See the Using NetBeans document for details.

Like others, the SetMap project has multiple Main Classes intended to illustrate various independent features. The simplest way to run the various classes is to right-click the file and select Run File either in the file itself or from the listing in the Projects window.

The Java files from the project which are referenced in this document are:
adapter/
  SetAdapter
util/
  OpenHashSet
demo/
  ShowHashCodes
  ShowSets
  CompareSets
  OpenHashSetDemo
  OpenHashRemove

Hashing concepts

The naïve idea of hashing is to store an element into an array at a position index computed as follows: The obvious issue is that there may be two or more elements which map to the same index; if so, we say there is a collision.

A hash function which produces no collisions for a set of elements is referred to as perfect hash function for that set. If we had a perfect hash function we could simply store the element at that index position in the array, ignoring empty array positions. The cost of searching for such an element would be no more than the cost of computing the hash code for that element.

Here is a depiction of a "perfect" hash table of size 11.
Imagine the following set of strings and their
computed hash values:
element       hash(element)
-------       -------------
"beer"            5
"afterlife"       9
"wisdom"          4
"politics"        10
"schools"         1
"fear"            3
The hash table would look like what you see here
0  
1 "schools"
2  
3 "fear"
4 "wisdom"
5 "beer"
6  
7  
8  
9 "afterlife"
10 "politics"
The appearance of the following element would cause a collision:
element       hash(element)
-------       -------------
"painting"        5

Java's hashCode function

In order to do hashing, we must have access to, or be able to create a suitable hash code for any element. In a generic situation, this means that the element type must be "hash-able". Fortunately, Java has a built-in, Object-level member function called hashCode which generates such a hash code for any element. You can try this simple example to view the hashCode values of various element:
for ( Object obj : 
  new Object[] { "a", "A", new Integer(10), "a0", "a1", new Float(10.0) } ) 
{
  System.out.println(obj.hashCode());
}
The output is:
97
65
10
3055
3056
1092616192
For a single character, the hashCode is the ASCII value, for the Integer, the value itself, otherwise some function combines the object's data bits to create the hashCode. For example, it appears that the string "a0" hashes according to the following computation:
ascii_code("a")*31 + ascii_code("0") = 97*31 + 48 = 3055
An important notion of hashing is that the same element will generate the same hash code every time the program is executed on any kind of system running Java, for now and forever in the future. Amen!

Hashing is really not constant time, but we say it is

If we have no collisions, the search for a given string, would simply involve computing the hash index and look in table at that index position. Based on this idea, it is usually stated that the cost of a perfect hash-based search is O(1), or constant time, regardless of the number of elements.

However, the constant time claim is not technically correct. The reason is that in order to have the ability to create N distinct elements, each element must be constructed from a finite set of symbols using a sequence of length Ω(log(N)). The hash function would have to process the entire symbol sequence in order to avoid duplicating hash codes, thus forcing an Ω(log(N)) lower bound to compute the hash code.

Despite this technicality, hash computation time (at least for perfect hashing) is said to be O(1), or constant time.

Open hashing, chaining

The key issue in hashing is dealing with, or resolving, collisions. The simplest idea of resolving collisions is to group them together as a set of colliding elements accessed through the array at the hash index. The technique called open hashing (as opposed to closed) maintains the set of colliding elements in a separate data structure called a bucket accessed through the array index.

A hash search procedure becomes this operation: If all goes well with respect to the hashing operation, the buckets will be of a small size. The easiest way to manage them is using a simple linked list. Open hashing managed in this way is called chaining.

Consider this code segment:
int capacity = 11;
String[] strings = {"music", "beer", "afterlife", "wisdom",
                    "politics", "theater", "schools", "painting", "fear"};
for (String s : strings) {
  System.out.println(
    '"' + s + '"' 
        + "\n\thash code: " + s.hashCode() 
        + "\n\tarray index: " + Math.abs(s.hashCode() % capacity)
  );
}
The output is is for an array of capacity 11 with 9 string elements. The following depicts the execution of this code and the appearance of the desired hash table:
"music"
    hash code: 104263205
    array index: 2
"beer"
    hash code: 3019824
    array index: 5
"afterlife"
    hash code: 1019963096
    array index: 9
"wisdom"
    hash code: -787603007
    array index: 4
"politics"
    hash code: 547400545
    array index: 10
"theater"
    hash code: -1350043631
    array index: 2
"schools"
    hash code: 1917457279
    array index: 1
"painting"
    hash code: 925981380
    array index: 5
"fear"
    hash code: 3138864
    array index: 3
In particular we see that with 9 strings out of 11 indices we have only two collisions making the buckets at indices 2 and 5 each of size 2. This gives us the right sense about how well hashing can perform. Note that the order within the chains is not the entry order. Hashing per se need not make any attempt to keep track of the order of elements in any way.

Array capacity

Our example uses an array of capacity 11. The textbook chooses a default capacity of 101. Why 11 and 101 and not 10 and 100? The reason is that we want the capacity to be a prime number. What is important about being a prime? The reason is that it makes the mod operation less "predictable".

For example, suppose we had a table whose capacity was even and we want to put these Integer keys (whose .hashCode is the value) into the table:
key = 39934, 9656, 73232, 8856, 2342342, 3003338
The target index in each case would be:
index = key % capacity
What can we say about index? It is even. Thus the even capacity makes the resultant indices "predictable". So we should not allow an even capacity. The same can be said for a capacity value which is the multiple of some other small value. The rule of thumb is that the capacity should be either a prime or else the product of "large" primes.

Demo Programs

Initially there are three demo programs to test:
  1. ShowHashCodes: shows the hash codes and index values for various elements
    demo.ShowHashCodes  
  2. ShowSets: Illustrate TreeSet, HashSet, and LinkedHashSet in terms of retrieval of stored elements.
    demo.ShowSets  
  3. CompareSets: Create a sorted list of unique random integers from a list possibly containing duplicates. This program makes the timing comparison between:
    • Entering numbers into a HashSet, transferring the set to a list, and sorting the list.
    • Entering numbers into a TreeSet, transferring to a list (since it's already sorted).
    Our code does this operation twice and the second time is the one that "counts."
    demo.CompareSets  

Hash table load and rehashing

Assuming that the hash code computation is constant time, the lookup time now involves comparisons against elements in the collision chain and so it is critical importance to somehow ensure that the buckets remain of a small size.

The crucial measure of performance of a hash table is the load, which is simply:
load = (# elements in the hash table) / capacity
If the load becomes too large, the buckets, obviously, can become too large and cause degradation of the search time. For example, this hash table in the above example has a load of
9/11 = .82
The hash table implementation must establish a maximum allowable load value which Java refers to as the load factor. According to Java documentation, a load factor of .75 is recommended. For example, the table in our previous example has a load (.82) greater than the acceptable value.

Rehash

When the load becomes greater than the allowable load factor, a rehash operation is required. The operation is similar to what is done with the ArrayList class in that we put the same content into a larger array. It goes like this:
  1. Find a suitable larger capacity ≥ twice the previous capacity. We will assume capacity is always a prime numbers, meaning we have to find a prime number bigger than the current value of 2 * capacity.
  2. Allocate a new array using the increased capacity value.
  3. Traverse the old hash table. For each data element, compute a new index value (the hash code is the same, but the target index, subject to mod by a new capacity value, will most likely have changed), then reinsert the element into the new table.
As in the case of the ArrayList, we want to ensure that the average cost of adding an element is still O(1). This will be true so long as we Rehashing our above example goes from size 11 to 23 with these changes:
==> capacity: 11
==> size: 9
==> load: 0.82
1: schools
2: theater, music
3: fear
4: wisdom
5: painting, beer
9: afterlife
10: politics
==> capacity: 23
==> size: 9
==> load: 0.39
0: painting, wisdom
4: theater
8: fear
13: afterlife
16: politics, beer
18: schools
19: music
The only feature that is significantly different from the situation of the ArrayList is that the capacities must be primes, an issue which must be explored a bit.

Finding the next prime

The Java function used by our implementation to calculate the next larger prime is the function primeUp(n). The idea is to use a nested loop where the outer loop examines all odd numbers ≥ n and the inner loop sees if we can reject it by finding an odd divisor between 3 and its square root; the first number which cannot be rejected is the prime we want. Here is the definition of the primeUp function:
private static int primeUp(int n) {
  int prime = n;
  if (prime % 2 == 0) {
    prime += 1;
  }
  // start the test loop with prime = first odd number up from n
  boolean found_prime;
  do {                       // outer loop, testing odd numbers ≥ n
    found_prime = true;
    // check for divisibility for all odd numbers between 3 and sqrt(prime)
    for (int i = 3; i <= Math.sqrt(prime); i += 2) 
    {
      if (prime % i == 0) {
        found_prime = false;
        prime += 2;          // divisor found, reject this one and
        break;               // try the next odd number up
      }
    }
  } 
  while (!found_prime);
 
  return prime;
}
In our implementations of hashed structures, the rehash operation will increase capacity with the call:
capacity = primeUp( 2 * capacity + 1 );
Assuming that this is done when
n = .75 * capacity
the question is, how many integer divisions are necessary to find the next prime up? Our analysis below indicates that the number of divisions necessary is less than O(n). Since a rehash costs O(n) just to move the n elements into the new hash table, we conclude that the primeUp function adds no significant extra cost.

Primeup is o(n) on the average

Start by asking, how many outer loops will execute, i.e., how far, starting from
N = 2*capacity + 1
must we go to find the next prime? It is known that
For all N > 1, there is a prime between N and 2*N.
This would suggest that we might have to do N outer loops to find the next prime. With at least sqrt(N) divisions per testing loop, that would make up to
N * sqrt(N)/2 = O(N*sqrt(N))
divisions to find the next prime up from N.

Actually, it is much fewer. The result from the famous Prime Number Theorem states that
The number of primes ≤ x is approximately x/ln(x) using the natural logarithm, ln(x) = loge(x).
Think of it as saying that ln(x) is an upper bound on the average-size interval in which we will find a prime smaller than x.

So if we take x = 2*N as the upper bound on which we are guaranteed to find a prime, then the average size interval in which we should find a prime smaller than 2*N is no more than
ln(x) = ln(2*N) = 1 + ln(N)
In particular, going up from N, we will, on the average, find a prime within this many outer loops:
1 + ln(N)
The divisors tested are all odd numbers less than 2*N, so the inner loop executes less than this number of iterations:
sqrt((2*N)/2) = sqrt(N)
So let
N = 2*capacity + 1 = where we start looking for a prime.
n = .75 * capacity = the number of elements in the hash table when we rehash.
Use the replacement
capacity = n/.75
and get:
# divisions
 = (# outer loops) * (# divisions per outer loop)
 ≤ ( 1 + ln(N) ) * sqrt(N)
 = (ln(2*capacity + 1) + 1) * sqrt(2*capacity + 1)
 = (ln(2*n/.75 + 1) + 1) * sqrt(2*n/.75 + 1)
 = O(log(n) * sqrt(n)) 
 = o(sqrt(n) * sqrt(n))   (since log(n) = o(sqrt(n))) 
 = o(n)
The term o(n), using "little-o", means that the time is sub-linear, or:
# divisions ≤ c * n
for every constant c > 0, no matter how small.

HashSet Implementation

Our user-defined OpenHashSet class extends the SetAdapter class which was used as the basis of the SearchTree class in the TreeSets document.
adapter.SetAdapter  
Our implementation class is this:
util.OpenHashSet  
Run the following demo program:
demo.OpenHashSetDemo  

Basis

At the basis of the OpenHashSet are the Node class and data elements:
private static class Node<E> {
  E data;
  Node<E> next;
 
  Node(E data, Node<E> next) {
    this.data = data;
    this.next = next;
  }
}
private float loadFactor;  // default value .75
private int size = 0;
private Node<E>[] bucket;
 
private int hash(Object obj) {
  return Math.abs(obj.hashCode() % bucket.length);
}
private void rehash() {
  // increase table size, reinstall all elements
}
Both the capacity and the loadFactor can be set through the constructor:
public OpenHashSet(int capacity, float loadFactor) {
  capacity = primeUp(capacity); // we need a prime
  bucket = (Node<E>[]) new Object[capacity];
  this.loadFactor = loadFactor;
}

Dealing with null

Unlike TreeSets, Java HashSets and LinkedHashSets permit null to be in the set. The problem is that null must to be treated as an exceptional case, because hash(null) cannot be computed and the equality tests will fail as written. For simplicity, our implementation will disallow the null value by initially testing each operation with something like this:
if (obj == null) {
  throw new java.lang.IllegalArgumentException("null not allowed");
}
select
It is possible to allow null in the set using a dedicated boolean data member like this:
private boolean hasNull = false;
All operations add, remove, contains, treat null as a special case and simply consult and/or modify this data member, i.e.,
contains(null):  
  return hasNull
add(null):       
  if (hasNull) return false; else { hasNull = true; return true; }
remove(null):    
  if (hasNull) { hasNull = false; return true; } else return false;

Contains

The key idea in hashing-based operations is that every operation begins, either explicitly or implicitly, with code like this:
int index = hash(obj);      // find the hash index of the element
Node<E> n = bucket[index];  // access the Node chain for that index
For example, the contains function (in iterative form) is:
public boolean contains(Object obj) {
  int index = hash(obj);
  Node<E> n = bucket[index];
  while (n != null) {
    if (n.data.equals(obj)) {
      return true;
    }
    n = n.next;
  }
  return false;
}

Add

The add operation can rely on contains. We have to deal with the possibility of doing a rehash if the new size would push beyond the allowable loadFactor, in which case the hash value must be recomputed. When an element is added, it is irrelevant where it is added within the node chain, and so our implementation effects an "addFirst" operation to the node chain.
public boolean add(E elt) {
  if (contains(elt)) {
    return false;
  }
  ++size;
  if ( (float)size/bucket.length > loadFactor ) {
    rehash();
  }
  int index = hash(elt);
  bucket[index] = new Node<>(elt, bucket[index]);
  return true;
}

Remove

For the remove operation, we again test for containment and if it doesn't contain it, we remove the element from the node chain.

The slickest way is to do the removal is to use a private recursive helper function written in the style used for operations in SearchTreeSet; i.e., the head of the node chain is treated as a reference parameter.
public boolean remove(Object obj) {
  if (!contains(obj)) { 
    return false;
  }
  --size;
  int num = hash(obj);
  bucket[num] = remove(bucket[num], obj);
  return true;
}
private Node<E> remove(Node<E> n, Object obj) {
  if (n.data.equals(obj)) {
    return n.next;
  } 
  else {
    n.next = remove(n.next, obj);
    return n;
  }
}
An alternative is this iterative solution in which we attempt to access the Node prior to the one containing the object in the chain, and then re-route around it. The exception to this situation is when the first Node is the one containing the object, and so there is no Node prior to it.
public boolean remove(Object obj) {
  if (!contains(obj)) { 
    return false;
  }
  --size;
  Node<E> n = bucket[num];
  if (n.data.equals(obj)) {
    bucket[num] = n.next;
  }
  else {
    while (! n.next.data.equals(obj)) {
      n = n.next;
    }
    n.next = n.next.next;  // reroute around
  }
  return true;
}

Test Program

Run the following demo program:
demo.OpenHashRemove  
Observe how we force the example hash table behavior by using the constructor call with all parameters:
OpenHashSet<String> set = new OpenHashSet<>(11, .9F);

LinkedHashSet

The purpose of the LinkedHashSet object, as we have seen, is to keep track of the entry order of elements while retaining its O(1) time for the lookup-based operations. The way to achieve this effect is to maintain, in addition to the hash table, a (doubly) linked list to hold the actual elements.

The contains, add, and remove operations use the hash table and so are O(1) (constant-time) operations. The major difference between HashSet and LinkedHashSet is how the iterator (and thereby print and for-loop accesses) works. With a HashSet, the iterator delivers the elements in no particular order, say, down the array and across the Node chains. In contrast, the LinkedHashSet uses the built-in linked list to iterate over the elements.

As an example, we'll revisit the hash entries into a table of capacity 11:
element       index
-------       -----
"music"         2
"beer"          5
"afterlife"     9
"wisdom"        4
"politics"      10
"theater"       2
"schools"       1
"painting"      5
"fear"          3
Here is a depiction of the underlying structure:
What is being suggested is the following additions/modifications to the node structure:
private class ListNode<E> {
  E data;
  ListNode<E> next;
  ListNode<E> prev;
  ListNode(E data, ListNode<E> next, ListNode<E> prev) {
    this.data = data;
    this.next = next;
    this.prev = prev;
  }
}
 
private static class Node<E> {
  ListNode<E> ptr;
  Node<E> next;
  Node(ListNode ptr, Node<E> next) {
    this.ptr = ptr;
    this.next = next;
  }
}
 
private ListNode<E> first, last;
In particular, the newly create ListNode inner class is used for the doubly-linked list which maintains the entry order of elements. Within the hash table, the Node class has been modified so that the ptr field is a pointer to the ListNode which contains the actual data. What the letters A, B, C, etc., are suggesting are pointers within the Node elements to the appropriate ListNode element, e.g., like this entry for "schools":
Without going into details, here is an outline of the operations:
  1. contains(obj):
    while moving through the pointer chain in the hash table, the test for equality now becomes this:
    if (n.ptr.data.equals(obj))
  2. rehash:
    Simply traverse the linked list, adding each node p into the newly created hash table according the index hash(p.data).
  3. add(elt):
    After testing contains, do an addLast(elt) on the linked list, retrieving a pointer, p to the added ListNode node, then:
    bucket[index] = new Node(p, bucket[index]);
  4. remove(obj):
    After testing contains, obtain the ListNode pointer p for the obj in the linked list. Using p.next and p.prev we can reroute the linked list around p, thereby removing it. A useful helper function is this modification of contains:
    private Node<E> search(Object obj) {
      int index = hash(obj);
      Node<E> n = bucket[index];
      while (n != null) {
        if ( n.ptr.data.equals(obj) ) { 
          break;
        } 
        else {
          n = n.next;
        }
      }
      return n;
    }
    Invoking Node<E> n = search(obj) indicates whether contains should be true or not by testing n != null; furthermore ListNode p = n.ptr becomes the ListNode we want to remove from the linked list.


© Robert M. Kline