HashSets + HashMaps
— print (last updated: Oct 26, 2009) print

Select font size:
The sample classes and programs are meant be be installed into the newly NetBeans project HashDemo. The Main class will initially require
import java.util.*;

Hashing discussion

The naïve idea of hashing is to store each element into an array where the array index gotten by: If two or more elements map to the same index, we say there is a collision. If there are no collisions for a set of elements it is referred to as perfect hashing. In the circumstance of perfect hashing we could simply store the element at that index position in the array and the "cost" of searching for that element would be merely the cost of computing the hash code.

It is usually stated that the cost of a perfect hash-based lookup is constant time, O(1), with respect to the number, N, of elements. This claim, however, is not technically correct. The reason is that in order to have the ability to create N distinct elements, each element must be constructed from a finite set of symbols using a sequence of length Ω(log(N)). The hash function would have to process the entire symbol sequence in order to avoid duplicating hash codes, thus forcing an Ω(log(N)) lower bound for computing the hash code.

Java's hashCode function

In order to do hashing, we must have access to, or be able to create a suitable hash code for any element. In a generic situation, this means that the element type must be "hashable". Fortunately, Java has a built-in, Object-level member function called hashCode which generates such a hash code for any element. You can try this simple example to view the hashCode values of various element:
for ( Object o : 
  new Object[] { "a", "A", new Integer(10), "a0", "a1", new Float(10.0) } ) 
{
  System.out.println(o.hashCode());
}
The output is:
97
65
10
3055
3056
1092616192
For a single character, the hashCode is the ASCII value, for the Integer, the value itself, otherwise some function combines the object's data bits to create the hashCode. An important notion of hashing is that it is replicable, i.e., the same element will generate the same hash code every time the program is executed on any kind of system running Java.

Open hashing, chaining

The key issue in hashing is dealing with, or resolving, collisions. The simplest idea of resolving collisions is to group them together as a set of colliding elements accessed through the array at the specific index. This collision set is often called a bucket. The technique called open hashing (as opposed to closed) maintains the bucket as an entity separate from the array. A hash search becomes this operation: If all goes well with respect to the hashing operation, the buckets will be of a small size and we can simply manage them as a linked list. Open hashing managed in this way is called chaining.

Assuming that the hash code function is constant time, the lookup time now involves comparisons against elements in the bucket. Since the bucket size is unpredictable, it is ususally said that the hash search has constant, or O(1), average time. This notion, however implies that the number of buckets must be able to grow as a function of the size.

Array capacity & Load factor

The textbook chooses an default capacity of 101. Why 101 and not 100? The reason is that 101 is a prime. What is important about being a prime? The reason is that it makes the mod operation less "predictable". For example, if we the array size were 100, and we had a set of elements all of whose hash codes were (possibly very large) even numbers; mod by 100 would remain even, meaning that we would narrow the range of target indices to half of the possible values.

The rule of thumb is that the array size (# of buckets) should be a prime or else the product of "large" primes. Consider this code segment:
int capacity = 11;
String[] strings = {"music", "beer", "afterlife", "wisdom",
                    "politics", "theater", "schools", "painting", "fear"};
for (String s : strings) {
  System.out.println(
    '"' + s + '"' 
        + "\n\thash code: " + s.hashCode() + ", "
        + "\n\tarray index: " + Math.abs(s.hashCode() % capacity)
  );
}
The output is:
array of size 11 with 9 strings
"music"
        hash code: 104263205, 
        array index: 2
"beer"
        hash code: 3019824, 
        array index: 5
"afterlife"
        hash code: 1019963096, 
        array index: 9
"wisdom"
        hash code: -787603007, 
        array index: 4
"politics"
        hash code: 547400545, 
        array index: 10
"theater"
        hash code: -1350043631, 
        array index: 2
"schools"
        hash code: 1917457279, 
        array index: 1
"painting"
        hash code: 925981380, 
        array index: 5
"fear"
        hash code: 3138864, 
        array index: 3
In particular we see that with 9 strings out of 11 indices we have only two collisions making the buckets at indices 2 and 5 each of size 2. This gives us the right sense about how well hashing can perform. A crucial measure to keep track of is the load, which is:
load = (# elements in the hash table) / capacity
For example, this hash table has load
9/11 = .82

Rehashing

When the number of elements becomes "too large" with respect to the capacity, the buckets, obviously, can become too large and thereby cause a degradation in the search time. In practice, the code writer must establish a maximum allowable load, which Java refers to as the load factor. The Java documentation recommends a load factor of .75. For example, the table in our previous example has a load greater than the Java-acceptable load factor.

When the load becomes greater than the predetermined load factor, an O(N) operation of rehashing is required. The effect is this:
  1. create a bigger table: make the new capacity a prime larger than double the original capacity
  2. traverse the old table and, one by one, compute new index values (the hash codes are the same, but the target indices will change) and reinsert each element into the new table

Demo Programs

Do the following to create demos for trees: This section illustrates two "main" programs:
  1. Compare TreeSet, HashSet, and LinkedHashSet in terms of retrieval of stored elements.
  2. Create a sorted list of unique random integers from a list possibly containing duplicates. Do so by inserting from the original array into a set and transferring the set contents to a list. Compare times using both the TreeSet and HashSet implementations.

mainShowSets

import java.util.*;

public class Main {
  public static void main(String[] args) {
    mainShowSets(args);
  }

  public static void mainShowSets(String[] args) {
    Integer[] I = new Integer[20];
    Random r = new Random();
    for (int n = 0; n < I.length; ++n) {
      I[n] = r.nextInt(50);
    }
    System.out.println("I:  " + Arrays.toString(I));

    Set<Integer> HS = new HashSet<Integer>();
    Set<Integer> LS = new LinkedHashSet<Integer>();
    Set<Integer> TS = new TreeSet<Integer>();

    for (Integer n : I) {
      HS.add(n);
      LS.add(n);
      TS.add(n);
    }
    System.out.println("HS: " + HS);
    System.out.println("LS: " + LS);
    System.out.println("TS: " + TS);
  }
}

mainCompareTreeHashSets

The comparison is between: Our code does this operation twice and the second time is the one that "counts." We do this because each operation encounters initial latencies in the Java Virtual Machine that we want to factor out and try to obtain a more "steady state" behavior.
  public static void mainCompareTreeHashSets(String[] args) {
    long start_hash = 0;
    long start_sort = 0;
    long end_hash = 0;
    long start_tree = 0;
    long end_tree = 0;
    Set<Integer> HS = null;
    Set<Integer> TS = null;
    Integer[] I = null;

    final int NUM = 40000;

    for (int i = 0; i < 2; ++i) {
      I = new Integer[NUM];
      Random r = new Random();
      for (int n = 0; n < I.length; ++n) {
        I[n] = r.nextInt(NUM);
      }

      HS = new HashSet<Integer>();
      TS = new TreeSet<Integer>();

      List<Integer> HSL = new ArrayList<Integer>();
      List<Integer> TSL = new ArrayList<Integer>();

      start_hash = System.currentTimeMillis();

      for (Integer n : I) { // add the integers to the hash set, HS
        HS.add(n);
      }

      start_sort = System.currentTimeMillis();

      for (Integer n : HS) {  // transfer from the tree set to the list, HSL
        HSL.add(n);
      }
      Collections.sort(HSL);  // sort the list

      end_hash = System.currentTimeMillis();

      // Now, create the same effect using a TreeSet

      start_tree = System.currentTimeMillis();

      for (Integer n : I) {  // add the integers to the tree set, TS
        TS.add(n);
      }
      for (Integer n : TS) { // transfer from the tree set to the list, TSL
        TSL.add(n);
      }
      end_tree = System.currentTimeMillis();
    }

    System.out.println("total: " + I.length);
    System.out.println("total unique: " + TS.size());
    System.out.println("hash: " + (end_hash - start_hash) + "ms");
    System.out.println("\tenter: " + (start_sort - start_hash) + "ms");
    System.out.println("\tsort: " + (end_hash - start_sort) + "ms");
    System.out.println("tree: " + (end_tree - start_tree) + "ms");
  }

HashMap Examples

This program demonstrates the HashMap and LinkedHashMap classes.
  public static void mainRegularLinkedHashSets(String[] args) {
    Map<String, Integer> hmap = new HashMap<String, Integer>();
    Map<String, Integer> lhmap = new LinkedHashMap<String, Integer>();

    //Map<String, Integer> hmap = new OpenHMap<String, Integer>();
    //Map<String, Integer> lhmap = new LinkedOpenHMap<String, Integer>();

    String[] keys = {
      "fashion", "radio", "technology", "television",
      "movies", "business", "wine", "science"
    };

    int[] values = {5, 17, 7, 10, 8, 3, 14, 11};

    int index = 0;
    for (String key : keys) {
      int value = values[index++];
      System.out.println("entry: (" + key + "," + value + ")");
      hmap.put(key, value);
      lhmap.put(key, value);
    }

    System.out.println("hmap:  " + hmap);
    System.out.println("lhmap: " + lhmap);

    //((OpenHMap) hmap).tableOutput();
  }

HashMap Implementations

The HashMap and LinkedHashMap implement only the Map interface which we used before to build the SearchTreeMap class in TreeMaps + SearchTreeMap. Create the class:
Class Name: MapAdapter
package:    map
Do this by creating from the code
map.MapAdapter ( click to show )
or perhaps simply copy the entire map package from the TMapDemo project.

Open HashMap Implementation

The user-defined OpenHMap is an open-hashing implementation of the Map interface with only the most basic functions defined: size, get, put, toString. An additional operation
public void tableOutput()
can be used to reveal the internal hashed table structure.

Create the class:
Class Name: OpenHMap
package:    hash
Then insert the following content ( click to show )

Linked Open HashMap Implementation

Many of the members of OpenHMap are protected so that we can define our implementation LinkedOpenHMap as an extension of OpenHMap. These two stand in the same relation as in Java, i.e.,
class LinkedHashMap extends HashMap
What is added is a linked list of nodes whose data are (pointers to) the hash table nodes.

Create the class:
Class Name: LinkedOpenHMap
package:    hash
Then insert the following content ( click to show )

Prime calculation

The usage of prime numbers is an essential feature of hashing, and so we need to be able to obtain the "next prime number" bigger than some given value. This is the calculation:
  private int primeUp(int n) {
    int prime = n;
    if (prime % 2 == 0) {
      prime += 1;
    }
    // guess prime to be the first odd number up from n
    boolean found_prime;
    do {
      found_prime = true;
      // check for divisibility for all odd numbers between 3 and sqrt(prime)
      for (int i = 3; i <= Math.sqrt(prime); i += 2) {
        if (prime % i == 0) {
          found_prime = false;
          prime += 2;          // divisor found, try the next odd number up
          break;
        }
      }
    } while (!found_prime);
    return prime;
  }
This next prime computation is done quite efficiently but there is still the "double loop" to consider. When rehashing is required, we increase capacity by the call
    capacity = primeUp(2 * capacity + 1);
and so the call to primeUp uses n = 2*capacity + 1.

The first question is: how far from n to the next prime? Using the famous Prime Number Theorem we can expect the prime to be found after ln(n) outer loops.

The inner loop is at worst sqrt(n). Thus if the rehash is called when the hash size, N ~ .75 * capacity, we get an average case time of:
O(log(N)*sqrt(N)) = o(N)
Since a rehash costs O(N) just to move the nodes into the new hash table, the primeUp function adds no significant extra cost.

The internal structure

In the spirit of simplicity to create the chaining, we use a simple singly-linked list with this Node class:
protected class Node {
  K key;
  V value;
  Node next;
  Node(K key, V value, Node next) {
    this.key = key;
    this.value = value;
    this.next = next;
  }
  @Override
  public String toString() {
    return key + "=" + value;
  }
}
The toString function is needed only by the derived class in order to simplify its toString function.

The data elements are these:
protected int capacity;
private float loadFactor;
protected int size = 0;
protected Object[] bucket;   // array of Nodes
We would prefer to type the bucket array as:
protected Node[] bucket;
but Java refuses to make the necessary cast in the constructor:
bucket = (Node[]) new Object[capacity];
and so we are forced to do:
bucket = new Object[capacity];
In addition to these, the other crucial internal function is:
  protected int hash(K key) {
    return Math.abs(key.hashCode() % capacity);
  }

The put & get operations

The following internal put operation tacks new keys onto the head of the chain (a push operation). Although it seems like they should be added to the end of the chain, it really doesn't matter because the ordering of elements within the hash table is irrelevant.
  private Node findNode(K key, Node start) {
    Node curr;
    for (curr = start; curr != null; curr = curr.next) {
      if (curr.key.equals(key)) {
        break;
      }
    }
    return curr;
  }

  protected Node put(K key, V value, AtomicReference<V> saved) {
    int num = hash(key);
    Node curr = findNode(key, (Node) bucket[num]);
    
    if (curr == null) {  // create a new entry, return the new Node created

      float load = (float) size / capacity;

      if (load >= loadFactor) {
        rehash();
        num = hash(key);    // rehash this new one
      }

      // push the new node onto the bucket
      bucket[num] = new Node(key, value, (Node) bucket[num]);

      saved.set(null);           // no "old value", so set to null
      return (Node) bucket[num]; // return the new Node
    }
    else {                     // node with key found: no new entry created
      saved.set(curr.value);   // capture old value
      curr.value = value;      // reset the value
      return null;             // return null to indicate no new node
    }
  }
The public put function uses it like this:
  public V put(K key, V value) {
    AtomicReference<V> saved = new AtomicReference<V>();
    if (put(key, value, saved) != null) {
      ++size;
    }
    return saved.get();
  }
The public get operation is even simpler since there are no side-effects:
  public V get(Object obj) {
    K key = (K) obj;
    int num = hash(key);
    Node curr = findNode(key, (Node) bucket[num]);
    if (curr == null) {
      return null;
    } else {
      return curr.value;
    }
  }

Linked Open HashMap Implementation

As in Java, the class our implementation class is an extension of former class. The implementation class we have given ignores the issue of element removal. In that case, the only thing we need is an extra linked list to hold the hash Nodes as they are entered.

With the addition of the list, we are obliged to override the public put and toString functions, which, in this case must must the linked list for the ordering of elements. Here is the entire class repeated from above:
package hash;

import java.util.*;
import java.util.concurrent.atomic.AtomicReference;

public class LinkedOpenHMap<K, V> extends OpenHMap<K, V> {
  public LinkedOpenHMap() {
    super();
  }

  public LinkedOpenHMap(int capacity) {
    super(capacity);
  }

  public LinkedOpenHMap(int capacity, float loadFactor) {
    super(capacity, loadFactor);
  }

  private List<Node> list = new LinkedList<Node>();

  @Override
  public V put(K key, V value) {
    AtomicReference<V> saved = new AtomicReference<V>();

    Node insert = super.put(key, value, saved);
    if (insert != null) {
      ++size;
      list.add(insert);
    }
    return saved.get();
  }

  @Override
  public String toString() {
    return list.toString();
  }
}

Example

These are the key/value pairs used in the mainRegularLinkedHashSets function. The entry order is:
(fashion,5) (radio,17) (technology,7) (television,10)
(movies,8) (business,3) (wine,15) (science,11)
A depiction of the internal structure using our implementations would be this:

The OpenHMap, of course, would contain only the upper table and not the lower linked list.

Rehashing considerations

As explained above, the rehash operation requires that all the nodes be moved into a new table. This could conceivably be done in one of two ways:
  1. move the actual nodes and reestablish the linking in the new table
  2. move the (key,value) pairs into new nodes in the new table
We prefer the former because this hashing scheme can be more easily adapted for use in a linked hash map which store the hash node pointers in a linear list. Because the actual nodes are not changing, the linear list will still be valid after rehashing.

Element Removal

In a regular hash table removal is quite straight forward; simply remove the node from the linked list. In contrast, we see that the linked hash table depiction above is insufficient for our needs because the corresponding node must also be removed from the linked list. More linking is required to create the desired behavior.

In particular the hash Node class should also have a pointer to the node used in the linked list. In particular, we would be obliged to implement the linked list directly so that the list node could be stored in the hash nodes. A diagram of the modified linked hash map might look like this:

Open Hash Demo programs

The mainRegularLinkedHashSets function is already set up to run the demo. You need to The following program illustrates the triggering of the rehash operation. Keys are entered in two groups. The first group fits into the intial table with an ensured capacity of 17. The second group causes the load to go above the acceptable load factor and thereby force rehashing. You can run this in two ways, with map defined either as an OpenHMap or as an LinkedOpenHMap object.
  public static void mainShowRehash(String[] args) {
    Map<String,Integer> map = new OpenHMap<String,Integer>(17);
    //Map<String, Integer> map = new LinkedOpenHMap<String, Integer>(17);

    String[] keys = {
      "fashion", "radio", "technology", "television",
      "movies", "business", "wine", "science"
    };

    int[] values = {5, 17, 7, 10, 8, 3, 14, 11};

    System.out.println("key entry: " + Arrays.toString(keys));

    int index = 0;
    for (String key : keys) {
      int value = values[index++];
      map.put(key, value);
    }

    System.out.println("map: " + map);

    System.out.println("table output");
    ((OpenHMap) map).tableOutput();

    String[] more_keys = new String[]{
        "music", "beer", "afterlife", "wisdom",
        "politics", "theater", "schools", "painting"
      };

    System.out.println("key entry: " + Arrays.toString(more_keys));

    index = 0;
    for (String key : more_keys) {
      int value = values[index++];
      map.put(key, value);
    }

    System.out.println("map: " + map);

    System.out.println("table output");
    ((OpenHMap) map).tableOutput();
  }

Closed Hashing

A closed hashing implementation is one in which the elements stay in the array rather than being placed in an auxiliary collision set, such as a linked list. In order to achieve this, collisions must be resolved by "probing" a sequence of alternative locations. There are three standard alternatives for probing algorithms: linear, quadratic, double-hash.

Linear probing

The simplest idea is called linear probing. Upon collision, look for an alternative position in the sequence of indices following the hashed index.
hash(e) + 1
        + 2
        + 3
        + ... increase by adding 1, mod by capacity

Clustering Effects

There are two so-called clustering effects that come about from linear probing:
  1. primary clustering: when a key hashes into a probe path of other collisions, it follows the same path.
  2. secondary clustering: two or more colliding elements must follow the same probe path.
You may debate the choice of these names, but consider an the open hashing model. It does not have primary clustering, but it does have secondary clustering. So the term "primary" in some sense signifies "worse."

Quadratic probing

A technique which eliminates primary clustering is called quadratic probing in which the following sequence of alternative locations is explored on collision:
hash(e) + 1 
        + 4  = 22
        + 9  = 32
        + 16 = 42
        + ... increase by adding successive odd numbers, mod by capacity
One issue which must be resolved is whether we will be able to find an open slot with this method under reasonable circumstances. A relatively easy number-theoretic result ensures that if the capacity is prime and the load is under 50% that a the probe will succeed.

Double-hash probing

A technique which eliminates both primary and secondary clustering is called double-hasing. The idea is to compute a second hash value of the element to be inserted.
a = hash2(e)
We want 0 < a < capacity. Using this value, search this sequence of alternative locations:
hash(e) + a
        + 2 * a
        + 3 * a
        + ....   increase by adding a, mod by capacity
Because the "a" value is likely to be different for different elements, they probe different sequences of indices.

Implementation

Either create the class
set.SetAdapter ( click to show )
or perhaps simply copy the entire set package from the TMapDemo or TSetDemo projects.

Here is a class which implements closed hashing with the three probing versions. Choose a probing version by setting the private data member String whichProbe appropriately. Create the class:
Class Name: ClosedHSet
package:    hash
Then insert the following content ( click to show )

Like our other implementations the ClosedHSet class extends the basic set implementation, SetAdapter.

probe functions, add & contains

Each probe function returns -1 if the element to be is found and a (non-negative) array index if the element at the "end" of the probe sequence if the element is not found. Here are the probe function implementations:
  private int linear_probe_pos(E elt) {
    int pos = hash(elt);
    while (true) {
      if (data[pos] == null) {
        return pos;
      }
      else if (data[pos].equals(elt))
        return -1;
      else
        pos = (pos + 1) % capacity;
    }
  }
  
  private int quadratic_probe_pos(E elt) {
    int pos = hash(elt);
    int i = 1;
    while (true) {
      if (data[pos] == null) {
        return pos;
      } else if (data[pos].equals(elt)) {
        return -1;
      } else {
        pos = (pos + i) % capacity;
        i += 2;
      }
    }
  }

  // we could write a "primeDown" function and make
  // smaller_factor = primeDown(capacity), but we simplify it here
  // since this second hash value is less crucial than the first one.
  
  private int hash2(E elt) {
    int smaller_factor = capacity - 2;
    return 1 + Math.abs(elt.hashCode() % smaller_factor );
  }

  private int double_hash_probe_pos(E elt) {
    int pos = hash(elt);
    int amt = hash2(elt);
    while (true) {
      if (data[pos] == null) {
        return pos;
      } else if (data[pos].equals(elt)) {
        return -1;
      } else {
        pos = (pos + amt) % capacity;
      }
    }
  }
The add and contains functions calls a single probe function, probe_pos which is one of the three specific probe function implementations: linear_probe_pos, quadratic_probe_pos, or double_hash_probe_pos.
  public boolean add(E elt) {
    int pos = probe_pos(elt);
    if (pos == -1) {        // element found and not added
      return false;
    } else {                // element added at position pos
      ++size;
      data[pos] = elt;
      return true;
    }
  }

  public boolean contains(Object obj) {
    E elt = (E) obj;
    return probe_pos(elt) == -1;
  }

test program

The implementation-revealing tableDump member function prints the contents of non-empty array cells. If the hash code matches the position, an asterisk is printed; otherwise the hash code, and if relevant, the double-hash code are printed as well.

Here is a test program manufactured to illustrate the clustering effects and thereby compare the differences of each approach. It's assuming you already have the import:
import hash.*;
To make the comparison, run it three times using each of the three valid values of the private data member whichProbe.
  static void mainClosedHash(String[] args) {
    Set<String> set = new ClosedHSet<String>(30);

    String[] keys = {
      "games", "blogs", "cartoons", "music", "books", "health", "dining",
      "sunday", "store", "classified", "mobile", "update", "services"
    };

    for (String key : keys) {
      set.add(key);
    }

    ((ClosedHSet) set).tableOutput();
  }

Deletion in closed hashing

Deletion, or removal of elements is tricky in closed hashing because the removal of an element in a probe sequence would cause the search for elements further on in the probe sequence to be "lost."

Given this limitation, deletion is best done by the so-called lazy deletion method in which a deleted element is simply marked as removed. A special non-null object must be used for this purpose. For example, we choose an object of a special user-defined class Deleted so that deletion means:
data[delete_position] = new Deleted( ... );
This delete element can possibly contain useful information, like the deleted value, etc.

The probe functions must then be modified so as to: For example, for linear probing:
private int linear_probe_pos(E e) {
  int pos = hash(e);
  int delpos = -1;
  while (true) {
    if (data[pos] == null) {
      // reached the end of the probe path
      // if you found a delete position on the way, return that
      // for possible insertion
      return (delpos != -1) ? delpos : pos;
    }
    else if (data[pos].equals(e))
      // the element you're looking for was found
      return -1;
    else {
      // mark a deleted position if found
      if (data[pos] instanceof Deleted)  // a "deleted" marker
        delpos = pos;

      pos = (pos + 1) % capacity;
    }
  } 
}


© Robert M. Kline