Hash Functions and Hash Tables

(ハッシュ関数とハッシュ表)

Data Structures and Algorithms

10th lecture, December 1, 2016

http://www.sw.it.aoyama.ac.jp/2016/DA/lecture10.html

Martin J. Dürst

Today's Schedule

Summary of last lecture
Additional speedup for dictionary
Overview of hashing
Hash functions
Conflict resolution
Evaluation of hashing
Hashes in Ruby
Summary

Summary of Last Lecture

Balanced trees keep search/insertion/deletion in a dictionary ADT at O(log n) worst-case time
2-3-4 trees and B(+)trees increase the degree of a binary tree, but keep the tree height constant
Red-black-trees and AVL-trees impose limitations on the variation of the tree heigh
B-trees and B+ trees are very useful for file systems and databases on secondary storage

Time Complexity for Known Dictionary Implementations

Implementation	Search	Insertion	Deletion
Sorted array	`O`(log `n`)	`O`(`n`)	`O`(`n`)
Unordered array/linked list	`O`(`n`)	`O`(1)	`O`(`n`)
Balanced tree	`O`(log `n`)	`O`(log `n`)	`O`(log `n`)

Direct Addressing

Use an array with an element for each key value
Search: value = array[key], time: O(1)
Insertion/replacement: array[key] = value, time: O(1)
Deletion: array[key] = nil, time: O(1)
Example:
students = [] students[15815000] = "I.T. Aoyama"

Problem: Array size, non-numeric keys

Solution: Transform key with hash function

Overview of Hashing

(also called scatter storage technique)

Transform the key k to a compact space using the hash function hf
Use hf(k) instead of k in the same way as direct addressing
The array is called hash table
The hash function is evaluated in O(1) time
Search: value = table[hf(key)], time: O(1)
Insertion/replacement: table[hf(key)] = value, time: O(1)
Deletion: table[hf(key)] = nil, time: O(1)

Problems with Hashing

Choice/design of hash function
Example 1: def hf(k); k % 100; end

students[15815000 % 100] = "I.T. Aoyama"

Example 2: def hf(k); k.codepoints.sum; end

students["Hanako Aoyama".codepoints.sum] = ...
Resolution of conflicts
What happens with the following:

students[15815000 % 100] = "I.T. Aoyama"

students[15715000 % 100] = "K.S. Aoyama"

Overview of Hash Function

Goals:
1. From a key, calculate an index that is as smoothly distributed as possible
  (this is the reason for the word hash, as in hashed beef or hash brows)
2. Adjust the range of the result to the hash table size
Steps:
1. Calculate a large integer (e.g. int in C) from the key
2. Adjust this large integer to the hash table size using a modulo operation

Step 2 is easy. Therefore, we concentrate on step 1.
(often step 1 alone is called 'hash function')

Hash Function Example 1

int sdbm_hash(char key[])
{
    int hash = 0;
    while (*key) {
        hash = *key++ + hash<<6
               + hash<<16 - hash;
    }
}

Hash Function Example 2

(simplified from MurmurHash3; for 32 bit machines)

#define ROTL32(n,by) (((n)<<(by)) | ((n)>>(32-(by))))
int too_simple_hash(int key[], int length)
{
    int h = 0;
    for (int i=0; i<length; i++) {
        int k = key[i] * C1;  // C1 is a constant
        h ^= ROTL32(k, R1);  // R1 is a constant
    }
    h ^= h >> 13;
    h *= 0xc2b2ae35;
    return 
}

Evaluation of Hash Functions

Quality of distribution
Execution speed
Ease of implementation

Precautions for Hash Functions

Use all parts of the key
Counterexample: Using only characters 3 and 4 of a string → bad distribution
Do not use data besides the key
If some data attributes (e.g. price of a product, studen's total marks) change, the key will change and the data will not be found anymore
Collapse equivalent keys
Examples: Upper/lower case letters, top/bottom/left/right/black/white symmetries for the game of Go

Special Hash Functions

Universal hashing
- Include a random number to create a different hash function for each program execution
- Solution for some denial-of-service attacks
  (reference)
Perfect hash function
- Custom-designed hash function without conflicts
- Useful when data is completely predefined
- In the best case, the hash table is completely filled
- Application: Keywords in programming languages
- Example implementation: gnu gperf (in Japanese)

Cryptographic Hash Function

Used for electronic signatures, ...
Differences from general hash functions:
- Practically impossible to generate same output from different input
- Output usually longer (e.g. 128/256/384/512/... bits)
- Evaluation may take longer

Conflicts

A conflict happens when hf(k₁) = hf(k₂) even though k₁ ≠ k₂
This requires special treatment
Main solutions:
- Chaining
- Open addressing

Terms and Variables for Conflict Resolution

Fields in hash table: bins/buckets
Number of bins: m
(equal to the range of values of the hash function after the modulo operation)
Number of data items: n
Load factor (average number of data items per bin): α
(α = n/m)
For a good (close to random) hash function, the variation in the number of data items for each bin is low
(Poisson distribution)

Chaining

Store conflicting data items in a linked list
Each bin in the hash table is the head of a linked list
If the linked list is short, then search/insertion/deletion will be fast
The average length of the linked list is equal to load factor α
The load factor is usually greater than 1 (e.g. 3≦α≦6)
All operations are carried out in two steps:
1. Use hf(k) to determine the bin
2. Use the key to find the data item in the linked list

Implementation of Chaining

Implementation in Ruby: Ahashdictionary.rb
Uses Array in place of linked list
Uses Ruby's hash function

Open Addressing

Store key and data in hash table itself
In case of conflict, successively check different bin
For check number i, use hash function ohf(key, i)
- Linear probing: ohf(key, i) = hf(key) + i
- Quadratic probing: ohf(key, i) = hf(key) + c₁ i + c₂ i²
- Many other variations exist
The load factor has to be between 0 and 1; ≦0.5 is reasonable
Problem: Deletion is difficult

Time Complexity of Hashing

(average, for chaining)

Calculation of hash function
- Dependent on key length
- O(1) if key length is constant or limited
Search in bin
- Dependent on load factor
- O(1) if load factor is below a given constant
- O(n) in worst case, but this can be avoided by choice of hash function

Expansion and Shrinking of Hash Table

The efficiency of hashing depends on the load factor
If the number of data items increases, the hash table has to be expanded
If the number of data items decreases, it is desirable to shrink the hash table
Expansion/shrinking can be implemented by re-inserting the data into a new hash table
(changing the divisor of the modulo operation)
Expansion/shrinking is heavy (time: O(n))

Analysis of the Time Complexity of Expansion

If the hash table is expanded for each data insertion, this is extremely inefficient
Limit the number of expansions:
- Increase the size of the hash table whenever the number of data items doubles
- The time needed for the insertion of n data items (n=2^x) is
  2 + 4 + 8 + ... + n/2 + n < 2n = O(n)
- The time complexity per data item is O(n)/n = O(1)

(simple example of amortized analysis)

Evaluation of Hashing

Advantages:

Search/insertion/deletion are possible in (average) constant time
Reasonably good actual performance
No need for keys to be ordered
Wide field of application

Problems:

Sorting needs to be done separately
Proximity/similarity search is impossible
Expansion/shrinking requires time (possible operation interrupt)

Comparison of Dictionary Implementations

Implementation	Search	Insertion	Deletion	Sorting
Sorted array	`O`(log `n`)	`O`(`n`)	`O`(`n`)	`O`(`n`)
Unordered array/linked list	`O`(`n`)	`O`(1)	`O`(1)	`O`(`n` log `n`)
Balanced tree	`O`(log `n`)	`O`(log `n`)	`O`(log `n`)	`O`(`n`)
Hash table	`O`(1)	`O`(1)	`O`(1)	`O`(`n` log `n`)

The Ruby `Hash` Class

(Perl: hash; Java: HashMap; Python: dict)

Because dictionary ADTs are often implemented using hashing,
in many programming languages, dictionaries are are called hash
Creation: my_hash = {} or my_hash = Hash.new
Initialization: months = {'January' => 31, 'February' => 28, 'March' => 31, ... }
Insertion/replacement: months['February'] = 29
Lookup: this_month_length = months[this_month]
Hash in Ruby has more functionality than in other programming languages (presentation)

Implementation of Hashing in Ruby

Source: st.c
Used chaining until very recently
(originally by Peter Moore, University of California Berkeley (1989))
On Nov. 7, 2016 replaced by open addressing (by Vladimir Makarov, with help from Yura Sokolov)
Reason: Faster because open addressing works better with cash hierarchy
Used inside Ruby, too:
- Lookup of global identifiers such as class names
- Lookup of methods for each class
- Lookup of instance variables for each object

Summary

A hash table implements a dictionary ADT using a hash function
Main points:
Selection of hash function
Conflict resolution methods (chaining or open addressing)
Reanosably good actual performance
Wide field of application

Glossary

direct addressing: 直接アドレス表
hashing, scatter storage technique: ハッシュ法、挽き混ぜ法
hash function: ハッシュ関数
hash table: ハッシュ表
joseki: 定石 (囲碁)
universal hashing: 万能ハッシュ法
denial of service attack: DOS 攻撃、サービス拒否攻撃
perfect hash function: 完全ハッシュ関数
cryptographic hash function: 暗号技術的ハッシュ関数
electronic signatures: 電子署名
conflict: 激突
Poisson distribution: ポアソン分布
chaining: チェイン法、連鎖法
open addressing: 開番地法、オープン法
load factor: 占有率
linear probing: 線形探査法
quadratic probing: 二次関数探査法
divisor: (割り算の) 法
amortized analysis: 償却分析
proximity search: 近接探索
similarity search: 類似探索