Hash Tables

From

Revision as of 16:00, 19 April 2009 by Admin (Talk | contribs)
Jump to: navigation, search
← Maps and Dictionaries ↑ Contents: CS2 End


Hashing is called a Key to Address system. Objects (records) are stored in a table, and each object stored occupies a slot in the table. The location of the slot is called the address of the object. The address of an object is obtained using a function called a hash function. Each object has an associated key. The hash function takes the key as a parameter and returns the address of the object.

The primary operation it supports efficiently is a lookup: given a key (e.g. a person's name), find the corresponding value (e.g. that person's telephone number). It works by transforming the key using a hash function into a hash, a number that the hash table uses to locate the desired value.

Contents

Hash Table Behavior

A good hash function should distribute the objects throughout the table with equal probability. A good hash function should minimize collisions. In addition, a good hash function should be easy to calculate.

For example, a simple hash function for a table of size M is:

FUNCTION hash( key )
    RETURN (key % M);
ENDFUNCTION


However, this function works well only if the keys have an equal distribution.

Other examples of hash functions: ELF (Executable and Linking Format [Unix System V]), and Soundex

Collision Resolution

There are two basic techniques of collision resolution: Open Hashing (also called Separate Chaining) and Closed Hashing (also called Open Addressing).

Open Hashing

Open Hashing means that collisions are stored outside the table. It may be an overflow area or it may be a series of linked lists:
...IMAGE HERE...
Another version of open hashing is called Bucket Hashing. In bucket hashing, the slots of the table are organized into groups called buckets. The hash function returns a bucket number, and the item is inserted into an open slot in the bucket. If the bucket is full, there is an overflow area.

Closed Hashing

In closed hashing, collisions are stored in the table itself. This can create a condition known as a secondary collision.

One type of collision resolution is know as probing. In probing, the hash function becomes:

h( k ) + p( i ), where i is an iteration value and p( 0 ) = 0. The simplest of these is linear probing, where p( i ) = i. This, however, can result in a problem called clustering, which can increase the probability of secondary collisions.


Suppose the table has ten slots and the hash function gives an equal probability to any slot in the table. Thus the probability is 0.1 that a key hashes to any one slot. Now suppose that slots 0 through 3 are filled. Then the probability that the next value will end up in slot 4 is 0.5, because anything that hashes to slots 0, 1, 2, 3, or 4 will end up in slot 4.

In Quadratic Probing, the probe function is p( i ) = i2. With quadratic probing, if the size of the table is a prime and the table is less than one-half full, an new element can always be inserted. However, if the table is more than half full, the insertion might fail.

A problem with probing is that the same sequence is used for all keys. So, as collisions occur, secondary collisions increase. An alternative to this is double hashing. Double hashing is defined by the function: h1( k ) + i * h2( k ). Note that it is important that h2( k ) not produce a 0. A good choice for a double hashing function is the let the table size be a prime number M and to let R be another prime that is less than M, then h2( k ) = R – ( k % R ).

Load Factor

The load factor α is defined to be N/M, where N is the number of records in the table and M is the size of the table. For open hashing, we generally want the load factor to be close to 1. For closed hashing, we generally want to load factor to be less that 0.5.

Deletions

When deleting records from a hash table, there are two important considerations.

  1. Deleting a record must not hinder later searches. That is, it must not cut off a chain used

for probing.

  1. A slot freed because of a deletion must remain usable.


One solution to these problems is to use a tombstone. A tombstone is a special marker that states that a slot is free; however, it used to be part of a chain. If a search encounters a tombstone, it should continue going. When inserting, however, to avoid duplicate values in the table, a search must still traverse the entire chain before reusing the tombstoned slot.

Tombstones do lengthen the size of chains. One possibility is that after a deletion, continue to follow a chain, swapping the tombstone with the value along the chain. In this manner, tombstones are always moved to the end of chains and are eventually reused.

ReHashing

When a table gets too full or when chains get too long due to tombstones, create another table at least twice the size of the original table and process each record, using a new hash function, and insert the records into the new table.


CS2: Data Structures
Theory of Computation - ADT Preliminaries
Linear ADTs - Tree ADTs - Graph ADTs - Unordered Collection ADTs


Personal tools
MediaWiki Appliance - Powered by TurnKey Linux