Hash Tables and Hash Functions
(ハッシュ表とハッシュ関数)
Data Structures and Algorithms
10th lecture, December 1, 2022
https://www.sw.it.aoyama.ac.jp/2022/DA/lecture10.html
Martin J. Dürst

© 2009-22 Martin
J. Dürst 青山学院大学
Today's Schedule
- Leftovers and summary of last lecture
- Additional speedup for dictionaries
- Overview of hashing
- Hash functions
- Conflict resolution
- Evaluation of hashing
- Hashes in Ruby
- Summary
Summary of Last Lecture
- Balanced trees keep search/insertion/deletion in a dictionary ADT at
O(log n) worst-case time
- 2-3-4 trees, B-trees, and B+trees increase the degree of a binary tree,
but keep the tree height constant
- Red-black-trees and AVL-trees impose limitations on the variation of the
tree heigh
- B+ trees are very useful for file systems and databases on secondary
storage
Time Complexity for Known Dictionary Implementations
| Implementation |
Search |
Insertion |
Deletion |
| Sorted array |
O(log n) |
O(n) |
O(n) |
| Unordered array/linked list |
O(n) |
O(1) |
O(n) |
| Balanced tree |
O(log n) |
O(log n) |
O(log n) |
Can we do better?
Direct Addressing
- Use an array with an element for each key value
- Search:
value = array[key], time complexity: O(1)
- Insertion/replacement:
array[key] = value, time complexity:
O(1)
- Deletion:
array[key] = nil, time
complexity: O(1)
- Example:
students = []
students[15820000] = "I.T. Aoyama"
Problem: Array size, non-numeric keys
Solution: Transform key with a hash function
Overview of Hashing
(also called scatter storage technique)
- Transform the key k to a compact space using the hash
function hf
- Use hf(k) instead of k for locating the
data in the array
- The data is contained in the hash table (below:
table)
- hf(k) can be evaluated in constant time
(O(1))
- Search:
value = table[hf(key)], time complexity: O(1)
- Insertion/replacement:
table[hf(key)] =
value, time complexity: O(1)
- Deletion:
table[hf(key)] = nil, time complexity: O(1)
Problems with Hashing
- Choice/design of hash function
Example 1: remainder:
def hf(k); k % 100; end
students[15820000 % 100] = "I.T.Aoyama"
Example 2: sum of codepoints (character numbers):
def hf(k); k.codepoints.sum; end
students["Hanako Aoyama".codepoints.sum] = ...
- Resolution of conflicts
What happens with the following:
students[15821000 % 1000] = "I.T.Aoyama"
students[15721000 % 1000] = "K.S.Aoyama"
Overview of Hash Function
- Goals:
- From a key, calculate an index that is smoothly (randomly)
distributed
(this is the reason for the word hash, as in hashed beef or
hash brows)
- Adjust the range of the result to the size of the hash table
- Steps:
- Calculate a large integer (e.g.
int in C) from the
key
- Adjust this large integer to the hash table size (using a modulo
operation or Fibonacci
Hashing)
Goal/step 2 is easy. Therefore, we concentrate on goal/step 1.
(often step 1 alone is called 'hash function')
Hash Function Example 1
int sdbm_hash(char key[])
{
int hash = 0;
while (*key) {
hash = *key++ + hash<<6
+ hash<<16 - hash;
}
return hash;
}
Hash Function Example 2
(simplified from MurmurHash3;
for 32-bit machines)
#define ROTL32(n,by) (((n)<<(by)) | ((n)>>(32-(by))))
int too_simple_hash(int key[], int length)
{
int h = 0;
for (int i=0; i<length; i++) {
int k = key[i] * C1; // C1 is a constant
h ^= ROTL32(k, R1); // R1 is a constant
}
h ^= h >> 13;
h *= 0xc2b2ae35;
return h;
}
Frequent operations in hash functions: Addition (+),
multiplication (*), bitwise XOR (^), shift
(<<, >>)
Evaluation of Hash Functions
- Quality of distribution
- Execution speed
- Ease of implementation
Precautions for Hash Functions
- Use all parts of the key
Counterexample: Using only characters 3 and 4 of a string → bad
distribution
- Do not use data besides the key
If some data attributes (e.g. price of a product, student's marks) change,
the hash function result will change.
There will be no way anymore to find the data.
- Collapse equivalent keys
Example 1: Strings: Upper/lower case letters
Example 2: Corners in the game of Go:
top/bottom, left/right, diagonal, and black/white symmetries
Conflicts
- A conflict happens when
hf(k1) =
hf(k2) but k1 ≠
k2
- Even with good hash functions, conflicts happen quite easily
- This requires special treatment
- Main solutions:
Terms and Variables for Conflict Resolution
- Number of data items: n
- Fields in hash table: bins/buckets
- Number of bins: m
(equal to the range of values of the hash function after the modulo
operation)
- Load factor (average number of data items per bin): α (=
n/m)
- For a good (close to random) hash function, the variation in the number
of data items for each bin is low
(Poisson distribution)
Chaining
- Store conflicting data items in a linked list
- Each bin in the hash table starts a linked list
- If the linked list is short, then search/insertion/deletion will be
fast
- The average length of the linked list is equal to load factor α
- The load factor is usually greater than 1 (e.g. 3≦α≦6)
- All operations are carried out in three steps:
- Use hf(k) mod m to find the bin
- Use hf(k) (without modulo operation) to find a
candidate entry in the linked list
- Use the actual key k to confirm that we found the correct
data item in the linked list
Implementation of Chaining
- Implementation in Ruby: Ahashdictionary.rb
- Uses
Array in place of linked list
- Uses Ruby's
hash function
Open Addressing
- Store key and data in hash table itself
- In case of conflict, successively check different bins
- For check number i (i = 0, 1, 2,...), use hash
function ohf(key, i)
- Linear probing: ohf(key, i) =
hf(key) + i
- Quadratic probing: ohf(key, i) =
hf(key) + c1 i +
c2 i2
- Many other variations exist
- The load factor has to be between 0 and 1; ≦0.5 is reasonable
- Problem: Deletion is difficult
Time Complexity of Hashing
(average, for chaining)
- Calculation of hash function
- Dependent on key length
- O(1) if key length is
constant or limited
- Search in bin
- Dependent on load factor
- O(1) if load factor is
below a given constant
- O(n) in worst
case, but this can be avoided by choice of hash function
Expansion and Shrinking of Hash Table
- The efficiency of hashing depends on the load factor
- If the number of data items increases, the hash table has to be
expanded
- If the number of data items decreases, it is desirable to shrink the hash
table
- Expansion/shrinking can be implemented by re-inserting the data into a
new hash table
(changing the divisor of the modulo operation)
- Expansion/shrinking is heavy (time: O(n) )
Analysis of the Time Complexity of Expansion
- If the hash table is expanded for every data insertion, this is very
inefficient
- Limit the number of expansions:
- Increase the size of the hash table whenever the number of data items
doubles
- The time needed for the insertion of n data items
(n=2x) is
2 + 4 + 8 + ... + n/2 + n < 2n =
O(n)
- The time complexity per data item isO(n)/n=O(1)
- This is a simple example of amortized analysis.
Special Purpose Hash Functions
- Universal hashing
- Perfect hash function
- Cryptographic hash function
Universal Hashing
- Denial-of-service attack (reference):
- Attacker provides lots of data with same hash value
- Efficiency of hash degrades from O(1) to
O(n)
- Solution: Use a random number to create a different hash function for
each program execution
- Example:
ruby -e 'puts 123.hash, 123.hash' will produce
different results on different invocation, but the same result during the
same invocation
Perfect Hash Function
- Custom-designed hash function without conflicts
- Useful when data is completely predefined
- In the best case, the hash table is completely filled
- Application: Keywords in programming languages
- Example implementation: gnu gperf
(used in Ruby character property lookups)
Cryptographic Hash Function
- Used for electronic signatures, ...
- Differences from general hash functions:
- Output usually longer (e.g. 128/256/384/512/... bits)
- Evaluation may take longer
- One-way function, essentially impossible to invert (find k
from hf(k))
Evaluation of Hashing
Advantages:
- Search/insertion/deletion are possible in (average)
constant time
- Reasonably good actual performance
- No need for keys to be numeric or ordered
- Wide field of application
Problems:
- Sorting needs to be done separately
(Ruby Hashes store insertion order, but not key order)
- Proximity/similarity search is impossible
- Expansion/shrinking requires time (possible operation interrupt)
Comparison of Dictionary Implementations
| Implementation |
Search |
Insertion |
Deletion |
Sorting |
| Sorted array |
O(log n) |
O(n) |
O(n) |
O(n) |
| Unordered array/linked list |
O(n) |
O(1) |
O(n) |
O(n log n) |
| Balanced tree |
O(log n) |
O(log n) |
O(log n) |
O(n) |
| Hash table |
O(1) |
O(1) |
O(1) |
O(n log n) |
The Ruby Hash Class
(Perl: hash; Java: HashMap; Python:
dict; JavaScript, JSON: object)
- Because dictionary ADTs are often implemented using hashing,0
in many programming languages, dictionaries are also called "hash"
- Creation:
my_hash = {} or my_hash =
Hash.new
- Initialization:
months = {'January' => 31, 'February' => 28,
'March' => 31, ... }
- Insertion/replacement:
months['February'] = 29
- Lookup:
this_month_length = months[this_month]
Hash in Ruby has more functionality than in other
programming languages (presentation)
Implementation of Hashing in Ruby
- Source: st.c
- Used chaining until 2016
(originally by Peter Moore, University of California Berkeley (1989))
- In Nov. 2016 replaced
by open addressing (by Vladimir Makarov, with help from Yura Sokolov)
- Reason: Faster
because open addressing works better with modern cache hierarchy
- Used inside Ruby, too:
- Lookup of global identifiers such as class names
- Lookup of methods for each class
- Lookup of instance variables for each object
Summary
- A hash table implements a dictionary ADT using a hash
function
- Main design points:
- Selection of hash function
- Conflict resolution methods (chaining or open
addressing)
- Reasonably good actual performance
- Wide field of application
Glossary
- direct addressing
- 直接アドレス表
- hashing, scatter storage technique
- ハッシュ法、挽き混ぜ法
- hash function
- ハッシュ関数
- hash table
- ハッシュ表
- game of Go
- 囲碁
- joseki
- 定石 (囲碁)
- conflict
- 衝突
- Poisson distribution
- ポアソン分布
- chaining
- チェイン法、連鎖法
- open addressing
- 開番地法、オープン法
- load factor
- 占有率
- linear probing
- 線形探査法
- quadratic probing
- 二次関数探査法
- divisor
- (割り算の) 法
- amortized analysis
- 償却分析
- universal hashing
- 万能ハッシュ法
- perfect hash function
- 完全ハッシュ関数
- denial of service attack
- DOS 攻撃、サービス拒否攻撃
- cryptographic hash function
- 暗号技術的ハッシュ関数
- one-way function
- 一方向性関数
- electronic signature
- 電子署名
- proximity search
- 近接探索
- similarity search
- 類似探索