Tuesday, October 2, 2012

Hashtables are not so innocent afterall !

I have always thought, most of the problems which have seemingly a bad complexity are often solved very easily by hashing. Although, it is true for most the cases one should note that the time complexities offered by a hash depends on the quality of your hash. While the solution definitely reduces to an average case O(1), the worst case could still be O(n). How O(n) you ask ?
Lets talk about linear probing. This is one of techniques to resolve the conflicts. So, if hash of item A(while you are inserting it) comes out to be say 34, while item B already hashes to 34, then you put the item A at 35 if its free or continue in the same fashion o/w. This part (the linear probing) is also called as secondary search. Similarly, if you are searching and not inserting for item A then you probe linearly until you find an empty slot. An empty slot reached while probing linearly ensures that A is not in the table because of the technique with which we inserted item A. Sometimes, the value for the corresponding will actually be a list of items so that after hashing to a key one would iterate over the linked list of the items to finish the search. One could also employ some ideas from caching mechanism in maintaining those linked list. For example using a "most recently used first" approach would both be easy and effective. In both cases, a hashtable can a poor worst case time complexity if the hash function is not good.
Now, designing a good hash function has its own costs associated with it. A robust hash which ensures good level of distribution of data may be costly in itself. I had seen some MIT paper with a very good hash function. It involved some bit twiddling hacks to ensure it ran super fast.
EDIT: Here is the code

unsigned int
bitwisehash(char *word, int tsize, unsigned int seed)
{
    char c;
    unsigned int h;

    h = seed;
    for( ; ( c=*word )!='\0' ; word++ )
    {
 h ^= ( (h << 5) + c + (h >> 2) );
    }
    return((unsigned int)((h&0x7fffffff) % tsize));
}


One more problem arises if the table gets full. If it is implemented as an array we would have to allocate another array of double the size then rehash all the items. Too bad, you will agree. Although, for the chained type of implementation this issue wont arise.

EDIT: Just a random note for those working in Java. Know your DS. Leaarn the difference between Hashtables ane Hashmaps. Hashmaps may give very high performance gain over hashtables if concurrency is not an issue.

No comments:

Post a Comment