Thursday, 14 March 2013

Java Hashtable & Hashing

Leave a Comment

Hashtable is an implementation of a key-value pair data structure in java. You can store and retrieve a ‘value’ using a ‘key’ and it is an identifier of the value stored. It is obvious that the ‘key’ should be unique.

java.util.Hashtable extends Dictionary and implements Map. Objects with non-null value can be used as a key or value. Key of the Hashtable must implement hashcode() and equals() methods. By the end of this article you will find out the reason behind this 
condition.Hashtable
Generally a Hashtable in java is created using the empty constructor Hashtable(). Which is a poor decision and an often repeated mistake. Hashtable has two other constructors
Hashtable(int initialCapacity)
and
Hashtable(int initialCapacity, float loadFactor)
. Initial capacity is number of buckets created at the time of Hashtable instantiation. Bucket is a logical space of  storage for Hashtable.

Hashing and Hashtable

Before seeing java’s Hashtable in detail you should understand hashing in general. Assume that, v is a value to be stored and k is the key used for storage / retrieval, then h is a hash function where v is stored at h(k) of table. To retrieve a value compute h(k) so that you can directly get the position of v. So in a key-value pair table, you need not sequentially scan through the keys to identify a value.

h(k) is the hashing function and it is used to find the location to store the corresponding value v. h(k) cannot compute to a indefinite space. Storage allocated for a Hashtable is limited within a program. So, the hasing function h(k) should return a number within that allocated spectrum.

Hashing in Java

Java’s hashing uses uses hashCode() method from the key and value objects to compute. Following is the core code from Hashtable where the hashCode ‘h’ is computed. You can see that both key’s and value’s hashCode() method is called.
h += e.key.hashCode() ^ e.value.hashCode();
It is better to have your hashCode() method in your custom objects. String has its own hashCode methode and it computes the hashcode value as below:
s[0]*31^(n-1) + s[1]*31^(n-2) + ... + s[n-1]
If you don’t have a hashCode() method, then it is derived from Object class. Following is Javadoc Comment of hashCode() method from Object class:
* Returns a hash code value for the object. This method is
   * supported for the benefit of hashtables such as those provided by
   * <code>java.util.Hashtable</code>.
If you are going to write a custom hashCode(), then follow the following contract:
* The general contract of <code>hashCode</code> is:
    * <ul>
    * <li>Whenever it is invoked on the same object more than once during
    *     an execution of a Java application, the <tt>hashCode</tt> method
    *     must consistently return the same integer, provided no information
    *     used in <tt>equals</tt> comparisons on the object is modified.
The following is to improve performance of the Hashtable.
* <li>If two objects are equal according to the <tt>equals(Object)</tt>
   *     method, then calling the <code>hashCode</code> method on each of
   *     the two objects must produce the same integer result.
hashCode() guarantees distinct integers by using the internal address of the object.

Collision in Hashtable

When we try to restrict the hashing function’s output within the allocated address spectrue limit, there is a possibility of a collision. For two different keys k1 and k2, if we have h(k1) = h(k2), then this is called collision in hashtable. What does this mean, our hashing function directs us store two different values (keys are also different) in the same location.

When we have a collision, there are multiple methodologies available to resolve it. To name a few hashtable collision resolution technique, ‘separate chaining’, ‘open addressing’, ‘robin hood hashing’, ‘cuckoo hashing’, etc. Java’s hashtable uses ‘separate chaining’ for collision resolution in Hashtable.

Collision Resolution in java’s Hashtable

Java uses separate chaining for collision resolution. Recall a point that Hashtable stores elements in buckets. In separate chaining, every bucket will store a reference to a linked list. Now assume that you have stored an element in bucket 1. That means, in bucket 1 you will have a reference to a linked list and in that linked list you will have two cells. In those two cells you will have key and its corresponding value.

Hashtable Collision
Why do you want to store the key? Because when there is a collision i.e., when two keys results in same hashcode and directs to the same bucket (assume bucket 1) you want to store the second element also in the same bucket. You add this second element to the already created linked list as the adjacent element.

Now when you retrieve a value it will compute the hash code and direct you to a bucket which has two elements. You scan those two elements alone sequentially and compare the keys using their equals() method. When the key mathches you get the respective value. Hope you have got the reason behind the condition that your object must have hashCode() and equals() method.

Java has a private static class Entry inside Hashtable. It is an implementation of a list and you can see there, it stores both the key and value.

Hashtable performance

To get better performance from your java Hashtable, you need to
1) use the initialCapacity and loadFactor arguments
2) use them wisely

while instantiating a Hashtable.

initialCapacitiy is the number of buckets to be created at the time of Hashtable instantiation. The number of buckets and probability of collision is inversly proportional. If you have more number of buckets than needed then you have lesser possibility for a collision.

For example, if you are going to store 10 elements and if you are going to have initialCapacity as 100 then you will have 100 buckets. You are going to calculate hashCoe() only 10 times with a spectrum of 100 buckets. The possibility of a collision is very very less.

But if you are going to supply initialCapacity for the Hashtable as 10, then the possibility of collision is very large. loadFactor decides when to automatically increase the size of the Hashtable. The default size of initialCapacity is 11 and loadFactor is .75 That if the Hashtable is 3/4 th full then the size of the Hashtable is increased.
New capacity in java Hashtable is calculated as follows:
int newCapacity = oldCapacity * 2 + 1;
If you give a lesser capacity and loadfactor and often it does the rehash() which will cause you performance issues. Therefore for efficient performance for Hashtable in java, give initialCapacity as 25% extra than you need and loadFactor as 0.75 when you instantiate.

0 comments:

Post a Comment