# Hash Functions and Hash Tables

(ハッシュ関数とハッシュ表)

## Data Structures and Algorithms

### 10th lecture, November 19, 2015

http://www.sw.it.aoyama.ac.jp/2015/DA/lecture10.html

### Martin J. Dürst © 2009-15 Martin J. Dürst 青山学院大学

# Today's Schedule

• Additional speedup for dictionary
• Overview of hashing
• Hash functions
• Conflict resolution
• Evaluation of hashing
• Hash in Ruby
• Summary

# Time Complexity for Dictionary Implementations up to Here

Implementation Search Insertion Deletion
Sorted array O(log n) O(n) O(n)
Unordered array/linked list O(n) O(1) O(n)
Balanced tree O(log n) O(log n) O(log n)

# Direct Addressing

• Use an array with an element for each key value
• Search: `value = array[key]`, time: O(1)
• Insertion/replacement: `array[key] = value`, time: O(1)
• Deletion: `array[key] = nil`, time: O(1)
• Example:
```students = [] students = "Hanako Aoyama"```

Problem: Array size, non-numeric keys

Solution: key transformation

# Overview of Hashing

(also called scatter storage technique)

• Transform the key k to a compact space using the hash function hf
• Use hf(k) instead of k in the same way as direct addressing
• The array is called hash table
• The hash function is evaluated in O(1) time
• Search: `value = table[hf(key)]`, time: O(1)
• Insertion/replacement: ```table[hf(key)] = value```, time: O(1)
• Deletion: `table[hf(key)] = nil`, time: O(1)

Problem 1: Design of hash function

Problem 2: Resolution of conflicts

# Overview of Hash Function

• Goals:
1. From a key, calculate an index that is as smoothly distributed as possible
2. Adjust the range of the result to the hash table size
• Steps:
1. Calculate a large integer (e.g. `int` in C) from the key
2. Adjust this large integer to the hash table size using a modulo operation

Step 2 is easy. Therefore, we concentrate on step 1.
(often step 1 alone is called 'hash function')

# Hash Function Example 1

```int sdbm_hash(char key[])
{
int hash = 0;
while (*key) {
hash = *key++ + hash<<6               + hash<<16 - hash;
}
}```

# Hash Function Example 2

(simplified from MurmurHash3; For 32bit machines)

```#define ROTL32(n,by) (((n)<<(by)) | ((n)>>(32-(by))))
int too_simple_hash(int key[], int length)
{
int h = 0;
for (int i=0; i<length; i++) {
int k = key[i] * C1;  // C1 is a constant
h ^= ROTL32(k, R1);  // R1 is a constant
}
h ^= h >> 13;
h *= 0xc2b2ae35;
return
}```

# Evaluation of Hash Functions

• Quality of distribution
• Execution speed
• Ease of implementation

# Precautions for Hash Functions

• Use all parts of the key
Counterexample: Using only characters 3 and 4 of a string → bad distribution
• Do not use data besides the key
If some data attributes (e.g. price of a product, studen's total marks) change, the key will change and the data cannot be found anymore
• Collapse equivalent keys
Examples: Upper/lower case letters, top/bottom/left/right/black/white symmetries for Joseki in Go

# Special Hash Functions

• Universal hashing
• Use a random function to create a different hash function for each program execution
• Implementation: Change a constant in the hash function
• Solution for some denial-of-service attacks
(reference)
• Perfect hash function
• Custom-designed hash function without conflicts
• Useful when data is completely predefined
• In the best case, the hash table is completely filled
• Application: Keywords in programming languages
• Example implementation: gnu gperf (in Japanese)

# Cryptographic Hash Function

• Used in electronic signatures
• Differences from general hash functions:
• Practically impossible to generate same output from different input
• Output usually longer (e.g. 128/256/384/512/... bits)
• Evaluation may take longer

# Conflict

• A conflict happens when hf(k1) = hf(k2) even though k1k2
• This requires special treatment
• Main solutions:
• Chaining
• Open addressing

# Terms and Variables for Conflict Resolution

• Fields in hash table: bins/buckets
• Number of bins: m
(equal to the range of values of the hash function after the modulo operation)
• Number of data items: n
• Load factor (average number of data items per bin): α
(α = n/m)
• For a good (close to random) hash function, the variation in the number of data items for each bin is low
(Poisson distribution)

# Chaining

• Store conflicting data items in a linked list
• Each bin in the hash table is the head of a linked list
• If the linked list is short, then search/insertion/deletion will be fast
• The average length of the linked list is α
• All operations are carried out in two steps:
1. Use hf(k) to determine the bin
2. Use the key to find the data item in the linked list

# Implementation of Chaining

• Implementation in Ruby: Ahashdictionary.rb
• Uses `Array` in place of linked list
• Uses Ruby's hash function

# Open Addressing

• Store key and data in hash table itself
• In case of conflict, successively check different bin
• For check number i, use hash function ohf(key, i)
• Linear probing: ohf(key, i) = hf(key) + i
• Quadratic probing: ohf(key, i) = hf(key) + c1 i + c2 i2
• Many other variations exist
• The load factor has to be between 0 and 1; ≦0.5 is reasonable
• Problem: Deletion is difficult

# Time Complexity of Hashing

(average, for chaining)

• Calculation of hash function
• Dependent on key length
• O(1) if key length is constant or limited
• Search in bin
• Dependent on load factor
• O(1) if load factor is below a given constant
• O(n) in worst case, but this can be avoided by choice of hash function

# Expansion and Shrinking of Hash Table

• The efficiency of hashing depends on the load factor
• If the number of data items increases, the hash table has to be expanded
• If the number of data items decreases, it is desirable to shrink the hash table
• Expansion/shrinking can be implemented by re-inserting the data into a new hash table
(changing the divisor of the modulo operation)
• Expansion/shrinking is heavy (time: O(n))

# Analisys of the Time Complexity of Expansion

• If the hash table is expanded for each data insertion, this is extremely inefficient
• Limit the number of expansions:
• Increase the size of the hash table whenever the number of data items doubles
• The time needed for the insertion of n data items (n=2x) is
2 + 4 + 8 + ... + n/2 + n < 2n = O(n)
• The time complexity per data item is O(n)/n = O(1)

(simple example of amortized analysis)

# Evaluation of Hashing

Advantages:

• Search/insertion/deletion are possible in (average) constant time
• Reasonably good actual performance
• No need for keys to be ordered
• Wide field of application

Problems:

• Sorting needs to be done separately
• Proximity/similarity search is impossible
• Expansion/shrinking requires time (possible operation interrupt)

# Comparison of Dictionary Implementations

 Implementation Search Insertion Deletion Sorting Sorted array O(log n) O(n) O(n) O(n) Unordered array/linked list O(n) O(1) O(1) O(n log n) Balanced tree O(log n) O(log n) O(log n) O(n) Hash table O(1) O(1) O(1) O(n log n)

# The Ruby `Hash` Class

(Perl: `hash`; Java: `HashMap`; Python: `dict`)

• Because dictionary ADTs are often implemented using hashing,
in many programming languages, dictionaries are are called hash
• Creation: `my_hash = {}` または ```my_hash = Hash.new```
• Initialization: ```months = {'January' => 31, 'February' => 28, 'March' => 31, ... }```
• Insertion/replacement: `months['February'] = 29`
• Lookup: `this_month_length = months[this_month]`
• `Hash` in Ruby has more functionality than in other programming languages (presentation)

# Implementation of Hashing in Ruby

• Originally by Peter Moore, University of California Berkeley (1989)
• Source: st.c
• Uses chaining
• Expansion by a factor of about 2 whenever a load factor of 5.0 is reached (`ST_DEFAULT_MAX_DENSITY`)
• The size of the hash table is a power of 2 (`next_pow2`)
(earlier, it was a prime number close to a power of 2)
• Even if there are many deletions, the hash table is never shrunk
• Used inside Ruby, too:
• Lookup of global identifiers such as class names
• Lookup of methods for each class
• Lookup of instance variables for each object

# Summary

• A hash table implements a dictionary ADT using a hash function
• Main points: Selection of hash function, conflict resolution method
• Reanosably good actual performance
• Wide field of application

# Glossary

direct addressing

hashing, scatter storage technique
ハッシュ法、挽き混ぜ法
hash function
ハッシュ関数
hash table
ハッシュ表
joseki

universal hashing

denial of service attack
DOS 攻撃、サービス拒否攻撃
perfect hash function

cryptographic hash function

electronic signatures

conflict

Poisson distribution
ポアソン分布
chaining
チェイン法、連鎖法
open addressing

load factor

linear probing

quadratic probing

divisor
(割り算の) 法
amortized analysis

proximity search

similarity search