The SetMap Application
The examples in this and other documents in this section all part of the Java Application SetMap. Download the source archive SetMap.zip and install it as a Java Application with Existing Sources. See the Using NetBeans document for details. Like others, the SetMap project has multiple Main Classes intended to illustrate various independent features. The simplest way to run the various classes is to right-click the file and select Run File either in the file itself or from the listing in the Projects window. The Java files from the project which are referenced in this document are:adapter/ SetAdapter util/ OpenHashSet demo/ ShowHashCodes ShowSets CompareSets OpenHashSetDemo OpenHashRemove
Hashing concepts
The naïve idea of hashing is to store an element into an array at a position index computed as follows:- obtain element_hash_code of the element by processing the element's data and generating an integer value (the idea of "hashing" the element roughly means "grinding it up")
- use a simple mod operation to map into the array's range:
index = hash(element) = abs(element_hash_code % array_capacity)
The element_hash_code can be negative, and so the absolute value is needed to ensure that we obtain a non-negative value.
Imagine the following set of strings and their computed hash values: element hash(element) ------- ------------- "beer" 5 "afterlife" 9 "wisdom" 4 "politics" 10 "schools" 1 "fear" 3The hash table would look like what you see here ⇒ |
|
element hash(element) ------- ------------- "painting" 5
Java's hashCode function
In order to do hashing, we must have access to, or be able to create a suitable hash code for any element. In a generic situation, this means that the element type must be "hash-able". Fortunately, Java has a built-in, Object-level member function called hashCode which generates such a hash code for any element. You can try this simple example to view the hashCode values of various element:for ( Object obj : new Object[] { "a", "A", new Integer(10), "a0", "a1", new Float(10.0) } ) { System.out.println(obj.hashCode()); }
97 65 10 3055 3056 1092616192For a single character, the hashCode is the ASCII value, for the Integer, the value itself, otherwise some function combines the object's data bits to create the hashCode. For example, it appears that the string "a0" hashes according to the following computation:
ascii_code("a")*31 + ascii_code("0") = 97*31 + 48 = 3055An important notion of hashing is that the same element will generate the same hash code every time the program is executed on any kind of system running Java, for now and forever in the future. Amen!
Hashing is really not constant time, but we say it is
If we have no collisions, the search for a given string, would simply involve computing the hash index and look in table at that index position. Based on this idea, it is usually stated that the cost of a perfect hash-based search is O(1), or constant time, regardless of the number of elements. However, the constant time claim is not technically correct. The reason is that in order to have the ability to create N distinct elements, each element must be constructed from a finite set of symbols using a sequence of length Ω(log(N)). The hash function would have to process the entire symbol sequence in order to avoid duplicating hash codes, thus forcing an Ω(log(N)) lower bound to compute the hash code. Despite this technicality, hash computation time (at least for perfect hashing) is said to be O(1), or constant time.Open hashing, chaining
The key issue in hashing is dealing with, or resolving, collisions. The simplest idea of resolving collisions is to group them together as a set of colliding elements accessed through the array at the hash index. The technique called open hashing (as opposed to closed) maintains the set of colliding elements in a separate data structure called a bucket accessed through the array index. A hash search procedure becomes this operation:- compute the hash index: index = abs(element.hashCode() % array_size)
- search the set bucket[index] for an occurrence of the element
int capacity = 11; String[] strings = {"music", "beer", "afterlife", "wisdom", "politics", "theater", "schools", "painting", "fear"}; for (String s : strings) { System.out.println( '"' + s + '"' + "\n\thash code: " + s.hashCode() + "\n\tarray index: " + Math.abs(s.hashCode() % capacity) ); }
"music" hash code: 104263205 array index: 2 "beer" hash code: 3019824 array index: 5 "afterlife" hash code: 1019963096 array index: 9 "wisdom" hash code: -787603007 array index: 4 "politics" hash code: 547400545 array index: 10 "theater" hash code: -1350043631 array index: 2 "schools" hash code: 1917457279 array index: 1 "painting" hash code: 925981380 array index: 5 "fear" hash code: 3138864 array index: 3 |
![]() |
Array capacity
Our example uses an array of capacity 11. The textbook chooses a default capacity of 101. Why 11 and 101 and not 10 and 100? The reason is that we want the capacity to be a prime number. What is important about being a prime? The reason is that it makes the mod operation less "predictable". For example, suppose we had a table whose capacity was even and we want to put these Integer keys (whose .hashCode is the value) into the table:key = 39934, 9656, 73232, 8856, 2342342, 3003338The target index in each case would be:
index = key % capacityWhat can we say about index? It is even. Thus the even capacity makes the resultant indices "predictable". So we should not allow an even capacity. The same can be said for a capacity value which is the multiple of some other small value. The rule of thumb is that the capacity should be either a prime or else the product of "large" primes.
Demo Programs
Initially there are three demo programs to test:- ShowHashCodes:
shows the hash codes and index values for various elements
- ShowSets:
Illustrate TreeSet, HashSet, and
LinkedHashSet in terms of retrieval of stored elements.
- CompareSets:
Create a sorted list of unique random integers from a list possibly
containing duplicates.
This program makes the timing comparison between:
- Entering numbers into a HashSet, transferring the set to a list, and sorting the list.
- Entering numbers into a TreeSet, transferring to a list (since it's already sorted).
Hash table load and rehashing
Assuming that the hash code computation is constant time, the lookup time now involves comparisons against elements in the collision chain and so it is critical importance to somehow ensure that the buckets remain of a small size. The crucial measure of performance of a hash table is the load, which is simply:
load = (# elements in the hash table) / capacity
If the load becomes too large, the buckets, obviously, can
become too large and cause degradation of the search time.
For example, this hash table in the above example has a load of
9/11 = .82The hash table implementation must establish a maximum allowable load value which Java refers to as the load factor. According to Java documentation, a load factor of .75 is recommended. For example, the table in our previous example has a load (.82) greater than the acceptable value.
Rehash
When the load becomes greater than the allowable load factor, a rehash operation is required. The operation is similar to what is done with the ArrayList class in that we put the same content into a larger array. It goes like this:- Find a suitable larger capacity ≥ twice the previous capacity. We will assume capacity is always a prime numbers, meaning we have to find a prime number bigger than the current value of 2 * capacity.
- Allocate a new array using the increased capacity value.
- Traverse the old hash table. For each data element, compute a new index value (the hash code is the same, but the target index, subject to mod by a new capacity value, will most likely have changed), then reinsert the element into the new table.
- at least double the capacity
- ensure that the cost of each rehash is O(n).
==> capacity: 11 ==> size: 9 ==> load: 0.82 1: schools 2: theater, music 3: fear 4: wisdom 5: painting, beer 9: afterlife 10: politics
==> capacity: 23 ==> size: 9 ==> load: 0.39 0: painting, wisdom 4: theater 8: fear 13: afterlife 16: politics, beer 18: schools 19: music
Finding the next prime
The Java function used by our implementation to calculate the next larger prime is the function primeUp(n). The idea is to use a nested loop where the outer loop examines all odd numbers ≥ n and the inner loop sees if we can reject it by finding an odd divisor between 3 and its square root; the first number which cannot be rejected is the prime we want. Here is the definition of the primeUp function:private static int primeUp(int n) { int prime = n; if (prime % 2 == 0) { prime += 1; } // start the test loop with prime = first odd number up from n boolean found_prime; do { // outer loop, testing odd numbers ≥ n found_prime = true; // check for divisibility for all odd numbers between 3 and sqrt(prime) for (int i = 3; i <= Math.sqrt(prime); i += 2) { if (prime % i == 0) { found_prime = false; prime += 2; // divisor found, reject this one and break; // try the next odd number up } } } while (!found_prime); return prime; }
capacity = primeUp( 2 * capacity + 1 );Assuming that this is done when
n = .75 * capacitythe question is, how many integer divisions are necessary to find the next prime up? Our analysis below indicates that the number of divisions necessary is less than O(n). Since a rehash costs O(n) just to move the n elements into the new hash table, we conclude that the primeUp function adds no significant extra cost.
Primeup is o(n) on the average
Start by asking, how many outer loops will execute, i.e., how far, starting fromN = 2*capacity + 1must we go to find the next prime? It is known that
For all N > 1, there is a prime between N and 2*N.This would suggest that we might have to do N outer loops to find the next prime. With at least sqrt(N) divisions per testing loop, that would make up to
N * sqrt(N)/2 = O(N*sqrt(N))divisions to find the next prime up from N. Actually, it is much fewer. The result from the famous Prime Number Theorem states that
The number of primes ≤ x is approximately x/ln(x) using the natural logarithm, ln(x) = loge(x).Think of it as saying that ln(x) is an upper bound on the average-size interval in which we will find a prime smaller than x. So if we take x = 2*N as the upper bound on which we are guaranteed to find a prime, then the average size interval in which we should find a prime smaller than 2*N is no more than
ln(x) = ln(2*N) = 1 + ln(N)In particular, going up from N, we will, on the average, find a prime within this many outer loops:
1 + ln(N)The divisors tested are all odd numbers less than 2*N, so the inner loop executes less than this number of iterations:
sqrt((2*N)/2) = sqrt(N)So let
N = 2*capacity + 1 = where we start looking for a prime.
n = .75 * capacity = the number of elements in the hash table when we rehash.
Use the replacement n = .75 * capacity = the number of elements in the hash table when we rehash.
capacity = n/.75and get:
# divisions = (# outer loops) * (# divisions per outer loop) ≤ ( 1 + ln(N) ) * sqrt(N) = (ln(2*capacity + 1) + 1) * sqrt(2*capacity + 1) = (ln(2*n/.75 + 1) + 1) * sqrt(2*n/.75 + 1) = O(log(n) * sqrt(n)) = o(sqrt(n) * sqrt(n)) (since log(n) = o(sqrt(n))) = o(n)The term o(n), using "little-o", means that the time is sub-linear, or:
# divisions ≤ c * n
for every constant c > 0, no matter how small.
HashSet Implementation
Our user-defined OpenHashSet class extends the SetAdapter class which was used as the basis of the SearchTree class in the TreeSets document.Basis
At the basis of the OpenHashSet are the Node class and data elements:private static class Node<E> { E data; Node<E> next; Node(E data, Node<E> next) { this.data = data; this.next = next; } } private float loadFactor; // default value .75 private int size = 0; private Node<E>[] bucket; private int hash(Object obj) { return Math.abs(obj.hashCode() % bucket.length); } private void rehash() { // increase table size, reinstall all elements }
public OpenHashSet(int capacity, float loadFactor) { capacity = primeUp(capacity); // we need a prime bucket = (Node<E>[]) new Object[capacity]; this.loadFactor = loadFactor; }
Dealing with null
Unlike TreeSets, Java HashSets and LinkedHashSets permit null to be in the set. The problem is that null must to be treated as an exceptional case, because hash(null) cannot be computed and the equality tests will fail as written. For simplicity, our implementation will disallow the null value by initially testing each operation with something like this:if (obj == null) { throw new java.lang.IllegalArgumentException("null not allowed"); }select
private boolean hasNull = false;
contains(null): return hasNull add(null): if (hasNull) return false; else { hasNull = true; return true; } remove(null): if (hasNull) { hasNull = false; return true; } else return false;
Contains
The key idea in hashing-based operations is that every operation begins, either explicitly or implicitly, with code like this:int index = hash(obj); // find the hash index of the element Node<E> n = bucket[index]; // access the Node chain for that index
public boolean contains(Object obj) { int index = hash(obj); Node<E> n = bucket[index]; while (n != null) { if (n.data.equals(obj)) { return true; } n = n.next; } return false; }
Add
The add operation can rely on contains. We have to deal with the possibility of doing a rehash if the new size would push beyond the allowable loadFactor, in which case the hash value must be recomputed. When an element is added, it is irrelevant where it is added within the node chain, and so our implementation effects an "addFirst" operation to the node chain.public boolean add(E elt) { if (contains(elt)) { return false; } ++size; if ( (float)size/bucket.length > loadFactor ) { rehash(); } int index = hash(elt); bucket[index] = new Node<>(elt, bucket[index]); return true; }
Remove
For the remove operation, we again test for containment and if it doesn't contain it, we remove the element from the node chain. The slickest way is to do the removal is to use a private recursive helper function written in the style used for operations in SearchTreeSet; i.e., the head of the node chain is treated as a reference parameter.public boolean remove(Object obj) { if (!contains(obj)) { return false; } --size; int num = hash(obj); bucket[num] = remove(bucket[num], obj); return true; } private Node<E> remove(Node<E> n, Object obj) { if (n.data.equals(obj)) { return n.next; } else { n.next = remove(n.next, obj); return n; } }
public boolean remove(Object obj) { if (!contains(obj)) { return false; } --size; Node<E> n = bucket[num]; if (n.data.equals(obj)) { bucket[num] = n.next; } else { while (! n.next.data.equals(obj)) { n = n.next; } n.next = n.next.next; // reroute around } return true; }
Test Program
Run the following demo program:OpenHashSet<String> set = new OpenHashSet<>(11, .9F);
LinkedHashSet
The purpose of the LinkedHashSet object, as we have seen, is to keep track of the entry order of elements while retaining its O(1) time for the lookup-based operations. The way to achieve this effect is to maintain, in addition to the hash table, a (doubly) linked list to hold the actual elements. The contains, add, and remove operations use the hash table and so are O(1) (constant-time) operations. The major difference between HashSet and LinkedHashSet is how the iterator (and thereby print and for-loop accesses) works. With a HashSet, the iterator delivers the elements in no particular order, say, down the array and across the Node chains. In contrast, the LinkedHashSet uses the built-in linked list to iterate over the elements. As an example, we'll revisit the hash entries into a table of capacity 11:element index ------- ----- "music" 2 "beer" 5 "afterlife" 9 "wisdom" 4 "politics" 10 "theater" 2 "schools" 1 "painting" 5 "fear" 3Here is a depiction of the underlying structure:

private class ListNode<E> { E data; ListNode<E> next; ListNode<E> prev; ListNode(E data, ListNode<E> next, ListNode<E> prev) { this.data = data; this.next = next; this.prev = prev; } } private static class Node<E> { ListNode<E> ptr; Node<E> next; Node(ListNode ptr, Node<E> next) { this.ptr = ptr; this.next = next; } } private ListNode<E> first, last;

- contains(obj):
while moving through the pointer chain in the hash table, the test for equality now becomes this:if (n.ptr.data.equals(obj))
-
rehash:
Simply traverse the linked list, adding each node p into the newly created hash table according the index hash(p.data). - add(elt):
After testing contains, do an addLast(elt) on the linked list, retrieving a pointer, p to the added ListNode node, then:bucket[index] = new Node(p, bucket[index]);
- remove(obj):
After testing contains, obtain the ListNode pointer p for the obj in the linked list. Using p.next and p.prev we can reroute the linked list around p, thereby removing it. A useful helper function is this modification of contains:private Node<E> search(Object obj) { int index = hash(obj); Node<E> n = bucket[index]; while (n != null) { if ( n.ptr.data.equals(obj) ) { break; } else { n = n.next; } } return n; }