in binary sort the algorithm complexcity is O(logn).
can we do better ?
is there an algorithm that it has O(1) time complexcity for searching?
yes there is.
that is hashing.
a hash table is a collection of items which can store in such a way as to make it easy to find them later.
each position of the hash table, often called a slot , can hold an item and is named by an integer value starting at 0.
the mapping between an item and the slot where that item belongs in the hash table is called the hash function
the hash function will take any item in the collection and return an integer in the range of slot names, between 0 and n-1.
using the “remainder method” as the hash function for example, using hash(item) = item % 11 as the hash function. then, there has hash(54) = 10, hash(26) = 4, hash(93) = 5…
once the hash value is computed, we can insert each item
into the hash table at the designed position.
note that, if a hash table can carry m items at most
and then the hash table is containing k items now.
there a parameter lambda is introduced, “load factor”,
which equals to k/m.
when we want to search for an item, we only need to compute the hash value of the item and use it as
the slot name to refer to the item. so, the entire time complexity is O(1).
however what will happen if two items have the same hash value?
for instance, hash(44) = 44%11 = 0, and hash(77) = 77%11 = 0 ,44 and 77 hash the same hash value now…
that fact is called collision. it happens when two or more items those have the same hash value.
Creating a hash table, it is necessary to solve the collisions.
how perfect it is if a hash function maps each item to distinct slot! such a hash function is called perfect hash function in general.
unfortunately, there is no systematic way to construct a perfect hash function, given an arbitrary collection of items.
luckily, we do not need the perfect hash function to still get performance efficiency enough.
one way to always have a better hash function is to increase the size of the hash table so that each possible value in the item range can be accommodated. this guarantees that each item will have a unique slot although it is practical for small numbers of items, it is not feasible when the number of possible items is for example, if the items were nine-digit social security numbers, this method would require almost one billion slots. if we want to store data for a class
of 25 students, we will be wasting an enormous amount of memory.
considering the goal is to minimize the collision times in a hash table, there are a number of methods to extend the simple remainder method.
the first one: folding method. divide the item into equal-size pieces.
add the pieces together to give the resulting hash value.
(436-555-4601)-> (43+65+55+46+01) = 210 hash(210) = 210%11 = 1
another method is mid-square method
44*44 = 1936
1936 % 11 = 5
collision resolution
when the collision happened, we must figure out a way to place the second item in the hash table
one method is look for another open slot to store
the second item. to implement this idea, a simple way is to start
at the original hash value position and then move
in a sequential manner through the slots until we
encounter the first empty slot.
when we arrive at the end of the slot, we should
return back to the start of the slot to check out the entire slots.
this resolution process is called open addressing
in that it tries to find the next open slot or
address in the hash table.
rehashing
the general name of the process of looking for another
slot after a collision is rehashing.
newhashvalue = rehash(oldhashvalue)
rehash(pos) = (pos+3)% sizeoftable
a variation of the linear probing idea is called quadratic probing.
instead of using a constant “skip” value, we use a refresh function that increments . the hash value by 1,3,5,7,9 and so on.this means if the first hash value is h,the successive value are h+1, h+4, h+9,
h+16,h+25…
an alternative method for handling the collision problem is to
allow each slot to hold a reference to a collect of items. Chaining allows many items to exit at the same location in the hash table.
when collisions happen, the item is still placed
in the proper slot of the hash table.
as more and more items hash to the same location, the difficulty of searching for the item in the collection increases.
class HashTable(object):
""" a hash table have 11 slots better to be prime number """
def __init__(self):
self.size = 11
self.slots = [None] * self.size
self.data = [None] * self.size
def hash(self, key):
""" key:int, value:orther return the hash value of key """
return key % self.size
def rehash(self, oldhashvalue):
""" using to deal with collision oldhashvalue: the value from hash(key) return: rehash value """
return (oldhashvalue + 1) % self.size
def put(self, key, value):
""" put the key in slot s[hash value] put the value in data[hash value] """
hsv = self.hash(key)
if self.slots[hsv] is None:
self.slots[hsv] = key
self.data[hsv] = value
else:
if key == self.slots[hsv]:
self.data[hsv] = value # replace
else:
rhsv = self.rehash(hsv)
while (self.slots[rhsv] is not None) and (hsv != rhsv):
rhsv = self.rehash(rhsv)
if self.slots[rhsv] is None:
self.slots[rhsv] = key
self.data[rhsv] = value
else:
self.data[rhsv] = value # replace
self.slots[rhsv] = key
def get(self, key):
""" key : int return the correspond value in self.data """
hsv = self.hash(key)
if self.slots[hsv] == key:
return self.data[hsv]
else:
rhsv = self.rehash(hsv)
while self.slots[rhsv] != key and rhsv != hsv:
rhsv = self.rehash(rhsv)
if self.slots[rhsv] == key:
return self.data[rhsv]
else:
return None
def __getitem__(self, key):
return self.get(key)
def __setitem__(self, key, value):
self.put(key, value)
if __name__ == "__main__":
H=HashTable()
H[54]='cat'
H[26]='dog'
H[93]='lion'
H[17]='tiger'
H[77]='bird'
H[31]='cow'
H[44]="goat"
H[55]='pig'
H[20]='chicken'
print(H.slots)
print(H.data)
print("add 44=goat")
H[44]="goat"
print(H.slots)
print(H.data)
print("add 88=hot")
H[88]='hot'
print(H.slots)
print(H.data)
print("add 99=bomb")
H[99]='bomb'
print(H.slots)
print(H.data)
print("add 66=fish")
H[66]='fish'
print(H.slots)
print(H.data)
[77, 44, 55, 20, 26, 93, 17, None, None, 31, 54]
[‘bird’, ‘goat’, ‘pig’, ‘chicken’, ‘dog’, ‘lion’, ‘tiger’, None, None, ‘cow’, ‘c
at’]
add 44=goat
[77, 44, 55, 20, 26, 93, 17, 44, None, 31, 54]
[‘bird’, ‘goat’, ‘pig’, ‘chicken’, ‘dog’, ‘lion’, ‘tiger’, ‘goat’, None, ‘cow’,
‘cat’]
add 88=hot
[77, 44, 55, 20, 26, 93, 17, 44, 88, 31, 54]
[‘bird’, ‘goat’, ‘pig’, ‘chicken’, ‘dog’, ‘lion’, ‘tiger’, ‘goat’, ‘hot’, ‘cow’,
‘cat’]
add 99=bomb
[99, 44, 55, 20, 26, 93, 17, 44, 88, 31, 54]
[‘bomb’, ‘goat’, ‘pig’, ‘chicken’, ‘dog’, ‘lion’, ‘tiger’, ‘goat’, ‘hot’, ‘cow’,
‘cat’]
add 66=fish (replace)
[66, 44, 55, 20, 26, 93, 17, 44, 88, 31, 54]
[‘fish’, ‘goat’, ‘pig’, ‘chicken’, ‘dog’, ‘lion’, ‘tiger’, ‘goat’, ‘hot’, ‘cow’,
‘cat’]