Hash Tables and Hash Functions

(ハッシュ表とハッシュ関数)

Data Structures and Algorithms

10th lecture, December 1, 2022

https://www.sw.it.aoyama.ac.jp/2022/DA/lecture10.html

Martin J. Dürst

Today's Schedule

Leftovers and summary of last lecture
Additional speedup for dictionaries
Overview of hashing
Hash functions
Conflict resolution
Evaluation of hashing
Hashes in Ruby
Summary

Summary of Last Lecture

Balanced trees keep search/insertion/deletion in a dictionary ADT at O(log n) worst-case time
2-3-4 trees, B-trees, and B+trees increase the degree of a binary tree, but keep the tree height constant
Red-black-trees and AVL-trees impose limitations on the variation of the tree heigh
B+ trees are very useful for file systems and databases on secondary storage

Time Complexity for Known Dictionary Implementations

Implementation	Search	Insertion	Deletion
Sorted array	`O`(log `n`)	`O`(`n`)	`O`(`n`)
Unordered array/linked list	`O`(`n`)	`O`(1)	`O`(`n`)
Balanced tree	`O`(log `n`)	`O`(log `n`)	`O`(log `n`)

Can we do better?

Direct Addressing

Use an array with an element for each key value
Search: value = array[key], time complexity: O(1)
Insertion/replacement: array[key] = value, time complexity: O(1)
Deletion: array[key] = nil, time complexity: O(1)
Example:
students = [] students[15820000] = "I.T. Aoyama"

Problem: Array size, non-numeric keys

Solution: Transform key with a hash function

Overview of Hashing

(also called scatter storage technique)

Transform the key k to a compact space using the hash function hf
Use hf(k) instead of k for locating the data in the array
The data is contained in the hash table (below: table)
hf(k) can be evaluated in constant time (O(1))
Search: value = table[hf(key)], time complexity: O(1)
Insertion/replacement: table[hf(key)] = value, time complexity: O(1)
Deletion: table[hf(key)] = nil, time complexity: O(1)

Problems with Hashing

Choice/design of hash function
Example 1: remainder:

def hf(k); k % 100; end

students[15820000 % 100] = "I.T.Aoyama"

Example 2: sum of codepoints (character numbers):
def hf(k); k.codepoints.sum; end

students["Hanako Aoyama".codepoints.sum] = ...
Resolution of conflicts
What happens with the following:

students[15821000 % 1000] = "I.T.Aoyama"

students[15721000 % 1000] = "K.S.Aoyama"

Overview of Hash Function

Goals:
1. From a key, calculate an index that is smoothly (randomly) distributed
  (this is the reason for the word hash, as in hashed beef or hash brows)
2. Adjust the range of the result to the size of the hash table
Steps:
1. Calculate a large integer (e.g. int in C) from the key
2. Adjust this large integer to the hash table size (using a modulo operation or Fibonacci Hashing)

Goal/step 2 is easy. Therefore, we concentrate on goal/step 1.
(often step 1 alone is called 'hash function')

Hash Function Example 1

int sdbm_hash(char key[])
{
    int hash = 0;
    while (*key) {
        hash = *key++ + hash<<6
               + hash<<16 - hash;
    }
    return hash;
}

Hash Function Example 2

(simplified from MurmurHash3; for 32-bit machines)

#define ROTL32(n,by) (((n)<<(by)) | ((n)>>(32-(by))))
int too_simple_hash(int key[], int length)
{
    int h = 0;
    for (int i=0; i<length; i++) {
        int k = key[i] * C1;  // C1 is a constant
        h ^= ROTL32(k, R1);  // R1 is a constant
    }
    h ^= h >> 13;
    h *= 0xc2b2ae35;
    return h;
}

Frequent operations in hash functions: Addition (+), multiplication (*), bitwise XOR (^), shift (<<, >>)

Evaluation of Hash Functions

Quality of distribution
Execution speed
Ease of implementation

Precautions for Hash Functions

Use all parts of the key
Counterexample: Using only characters 3 and 4 of a string → bad distribution
Do not use data besides the key
If some data attributes (e.g. price of a product, student's marks) change, the hash function result will change.
There will be no way anymore to find the data.
Collapse equivalent keys
Example 1: Strings: Upper/lower case letters
Example 2: Corners in the game of Go:
top/bottom, left/right, diagonal, and black/white symmetries

Conflicts

A conflict happens when
hf(k₁) = hf(k₂) but k₁ ≠ k₂
Even with good hash functions, conflicts happen quite easily
This requires special treatment
Main solutions:
- Chaining
- Open addressing

Terms and Variables for Conflict Resolution

Number of data items: n
Fields in hash table: bins/buckets
Number of bins: m
(equal to the range of values of the hash function after the modulo operation)
Load factor (average number of data items per bin): α (= n/m)
For a good (close to random) hash function, the variation in the number of data items for each bin is low
(Poisson distribution)

Chaining

Store conflicting data items in a linked list
Each bin in the hash table starts a linked list
If the linked list is short, then search/insertion/deletion will be fast
The average length of the linked list is equal to load factor α
The load factor is usually greater than 1 (e.g. 3≦α≦6)
All operations are carried out in three steps:
1. Use hf(k) mod m to find the bin
2. Use hf(k) (without modulo operation) to find a candidate entry in the linked list
3. Use the actual key k to confirm that we found the correct data item in the linked list

Implementation of Chaining

Implementation in Ruby: Ahashdictionary.rb
Uses Array in place of linked list
Uses Ruby's hash function

Open Addressing

Store key and data in hash table itself
In case of conflict, successively check different bins
For check number i (i = 0, 1, 2,...), use hash function ohf(key, i)
- Linear probing: ohf(key, i) = hf(key) + i
- Quadratic probing: ohf(key, i) = hf(key) + c₁ i + c₂ i²
- Many other variations exist
The load factor has to be between 0 and 1; ≦0.5 is reasonable
Problem: Deletion is difficult

Time Complexity of Hashing

(average, for chaining)

Calculation of hash function
- Dependent on key length
- O(1) if key length is constant or limited
Search in bin
- Dependent on load factor
- O(1) if load factor is below a given constant
- O(n) in worst case, but this can be avoided by choice of hash function

Expansion and Shrinking of Hash Table

The efficiency of hashing depends on the load factor
If the number of data items increases, the hash table has to be expanded
If the number of data items decreases, it is desirable to shrink the hash table
Expansion/shrinking can be implemented by re-inserting the data into a new hash table
(changing the divisor of the modulo operation)
Expansion/shrinking is heavy (time: O(n) )

Analysis of the Time Complexity of Expansion

If the hash table is expanded for every data insertion, this is very inefficient
Limit the number of expansions:
- Increase the size of the hash table whenever the number of data items doubles
- The time needed for the insertion of n data items (n=2^x) is
  2 + 4 + 8 + ... + n/2 + n < 2n = O(n)
- The time complexity per data item isO(n)/n=O(1)
This is a simple example of amortized analysis.

Special Purpose Hash Functions

Universal hashing
Perfect hash function
Cryptographic hash function

Universal Hashing

Denial-of-service attack (reference):
- Attacker provides lots of data with same hash value
- Efficiency of hash degrades from O(1) to O(n)
Solution: Use a random number to create a different hash function for each program execution
Example: ruby -e 'puts 123.hash, 123.hash' will produce different results on different invocation, but the same result during the same invocation

Perfect Hash Function

Custom-designed hash function without conflicts
Useful when data is completely predefined
In the best case, the hash table is completely filled
Application: Keywords in programming languages
Example implementation: gnu gperf
(used in Ruby character property lookups)

Cryptographic Hash Function

Used for electronic signatures, ...
Differences from general hash functions:
- Output usually longer (e.g. 128/256/384/512/... bits)
- Evaluation may take longer
- One-way function, essentially impossible to invert (find k from hf(k))

Evaluation of Hashing

Advantages:

Search/insertion/deletion are possible in (average) constant time
Reasonably good actual performance
No need for keys to be numeric or ordered
Wide field of application

Problems:

Sorting needs to be done separately
(Ruby Hashes store insertion order, but not key order)
Proximity/similarity search is impossible
Expansion/shrinking requires time (possible operation interrupt)

Comparison of Dictionary Implementations

Implementation	Search	Insertion	Deletion	Sorting
Sorted array	`O`(log `n`)	`O`(`n`)	`O`(`n`)	`O`(`n`)
Unordered array/linked list	`O`(`n`)	`O`(1)	`O`(`n`)	`O`(`n` log `n`)
Balanced tree	`O`(log `n`)	`O`(log `n`)	`O`(log `n`)	`O`(`n`)
Hash table	`O`(1)	`O`(1)	`O`(1)	`O`(`n` log `n`)

The Ruby `Hash` Class

(Perl: hash; Java: HashMap; Python: dict; JavaScript, JSON: object)

Because dictionary ADTs are often implemented using hashing,0
in many programming languages, dictionaries are also called "hash"
Creation: my_hash = {} or my_hash = Hash.new
Initialization: months = {'January' => 31, 'February' => 28, 'March' => 31, ... }
Insertion/replacement: months['February'] = 29
Lookup: this_month_length = months[this_month]
Hash in Ruby has more functionality than in other programming languages (presentation)

Implementation of Hashing in Ruby

Source: st.c
Used chaining until 2016
(originally by Peter Moore, University of California Berkeley (1989))
In Nov. 2016 replaced by open addressing (by Vladimir Makarov, with help from Yura Sokolov)
Reason: Faster because open addressing works better with modern cache hierarchy
Used inside Ruby, too:
- Lookup of global identifiers such as class names
- Lookup of methods for each class
- Lookup of instance variables for each object

Summary

A hash table implements a dictionary ADT using a hash function
Main design points:
- Selection of hash function
- Conflict resolution methods (chaining or open addressing)
Reasonably good actual performance
Wide field of application

Glossary

direct addressing: 直接アドレス表
hashing, scatter storage technique: ハッシュ法、挽き混ぜ法
hash function: ハッシュ関数
hash table: ハッシュ表
game of Go: 囲碁
joseki: 定石 (囲碁)
conflict: 衝突
Poisson distribution: ポアソン分布
chaining: チェイン法、連鎖法
open addressing: 開番地法、オープン法
load factor: 占有率
linear probing: 線形探査法
quadratic probing: 二次関数探査法
divisor: (割り算の) 法
amortized analysis: 償却分析
universal hashing: 万能ハッシュ法
perfect hash function: 完全ハッシュ関数
denial of service attack: DOS 攻撃、サービス拒否攻撃
cryptographic hash function: 暗号技術的ハッシュ関数
one-way function: 一方向性関数
electronic signature: 電子署名
proximity search: 近接探索
similarity search: 類似探索