# Hash Functions and Hash Tables

(ハッシュ関数とハッシュ表)

## Data Structures and Algorithms

### 10th lecture, December 1, 2016

http://www.sw.it.aoyama.ac.jp/2016/DA/lecture10.html

### Martin J. Dürst

© 2009-16 Martin J. Dürst 青山学院大学

# Today's Schedule

• Summary of last lecture
• Additional speedup for dictionary
• Overview of hashing
• Hash functions
• Conflict resolution
• Evaluation of hashing
• Hashes in Ruby
• Summary

# Summary of Last Lecture

• Balanced trees keep search/insertion/deletion in a dictionary ADT at O(log n) worst-case time
• 2-3-4 trees and B(+)trees increase the degree of a binary tree, but keep the tree height constant
• Red-black-trees and AVL-trees impose limitations on the variation of the tree heigh
• B-trees and B+ trees are very useful for file systems and databases on secondary storage

# Time Complexity for Known Dictionary Implementations

Implementation Search Insertion Deletion
Sorted array O(log n) O(n) O(n)
Unordered array/linked list O(n) O(1) O(n)
Balanced tree O(log n) O(log n) O(log n)

# Direct Addressing

• Use an array with an element for each key value
• Search: `value = array[key]`, time: O(1)
• Insertion/replacement: `array[key] = value`, time: O(1)
• Deletion: `array[key] = nil`, time: O(1)
• Example:
```students = [] students[15815000] = "I.T. Aoyama"```

Problem: Array size, non-numeric keys

Solution: Transform key with hash function

# Overview of Hashing

(also called scatter storage technique)

• Transform the key k to a compact space using the hash function hf
• Use hf(k) instead of k in the same way as direct addressing
• The array is called hash table
• The hash function is evaluated in O(1) time
• Search: `value = table[hf(key)]`, time: O﻿(1)
• Insertion/replacement: ```table[hf(key)] = value```, time: O(1)
• Deletion: `table[hf(key)] = nil`, time: O(1)

# Problems with Hashing

1. Choice/design of hash function

Example 1: `def hf(k); k % 100; end`

`students[15815000 % 100] = "I.T. Aoyama"`

Example 2: `def hf(k); k.codepoints.sum; end`

`students["Hanako Aoyama".codepoints.sum] = ...`

2. Resolution of conflicts

What happens with the following:

`students[15815000 % 100] = "I.T. Aoyama"`

`students[15715000 % 100] = "K.S. Aoyama"`

# Overview of Hash Function

• Goals:
1. From a key, calculate an index that is as smoothly distributed as possible
(this is the reason for the word hash, as in hashed beef or hash brows)
2. Adjust the range of the result to the hash table size
• Steps:
1. Calculate a large integer (e.g. `int` in C) from the key
2. Adjust this large integer to the hash table size using a modulo operation

Step 2 is easy. Therefore, we concentrate on step 1.
(often step 1 alone is called 'hash function')

# Hash Function Example 1

```int sdbm_hash(char key[])
{
int hash = 0;
while (*key) {
hash = *key++ + hash<<6               + hash<<16 - hash;
}
}```

# Hash Function Example 2

(simplified from MurmurHash3; for 32 bit machines)

```#define ROTL32(n,by) (((n)<<(by)) | ((n)>>(32-(by))))
int too_simple_hash(int key[], int length)
{
int h = 0;
for (int i=0; i<length; i++) {
int k = key[i] * C1;  // C1 is a constant
h ^= ROTL32(k, R1);  // R1 is a constant
}
h ^= h >> 13;
h *= 0xc2b2ae35;
return
}```

# Evaluation of Hash Functions

• Quality of distribution
• Execution speed
• Ease of implementation

# Precautions for Hash Functions

• Use all parts of the key
Counterexample: Using only characters 3 and 4 of a string → bad distribution
• Do not use data besides the key
If some data attributes (e.g. price of a product, studen's total marks) change, the key will change and the data will not be found anymore
• Collapse equivalent keys
Examples: Upper/lower case letters, top/bottom/left/right/black/white symmetries for the game of Go

# Special Hash Functions

• Universal hashing
• Include a random number to create a different hash function for each program execution
• Solution for some denial-of-service attacks
(reference)
• Perfect hash function
• Custom-designed hash function without conflicts
• Useful when data is completely predefined
• In the best case, the hash table is completely filled
• Application: Keywords in programming languages
• Example implementation: gnu gperf (in Japanese)

# Cryptographic Hash Function

• Used for electronic signatures, ...
• Differences from general hash functions:
• Practically impossible to generate same output from different input
• Output usually longer (e.g. 128/256/384/512/... bits)
• Evaluation may take longer

# Conflicts

• A conflict happens when hf(k1) = hf(k2) even though k1k2
• This requires special treatment
• Main solutions:
• Chaining
• Open addressing

# Terms and Variables for Conflict Resolution

• Fields in hash table: bins/buckets
• Number of bins: m
(equal to the range of values of the hash function after the modulo operation)
• Number of data items: n
• Load factor (average number of data items per bin): α
(α = n/m)
• For a good (close to random) hash function, the variation in the number of data items for each bin is low
(Poisson distribution)

# Chaining

• Store conflicting data items in a linked list
• Each bin in the hash table is the head of a linked list
• If the linked list is short, then search/insertion/deletion will be fast
• The average length of the linked list is equal to load factor α
• The load factor is usually greater than 1 (e.g. 3≦α≦6)
• All operations are carried out in two steps:
1. Use hf(k) to determine the bin
2. Use the key to find the data item in the linked list

# Implementation of Chaining

• Implementation in Ruby: Ahashdictionary.rb
• Uses `Array` in place of linked list
• Uses Ruby's hash function

# Open Addressing

• Store key and data in hash table itself
• In case of conflict, successively check different bin
• For check number i, use hash function ohf(key, i)
• Linear probing: ohf(key, i) = hf(key) + i
• Quadratic probing: ohf(key, i) = hf(key) + c1 i + c2 i2
• Many other variations exist
• The load factor has to be between 0 and 1; ≦0.5 is reasonable
• Problem: Deletion is difficult

# Time Complexity of Hashing

(average, for chaining)

• Calculation of hash function
• Dependent on key length
• O(1) if key length is constant or limited
• Search in bin
• Dependent on load factor
• O(1) if load factor is below a given constant
• O(n) in worst case, but this can be avoided by choice of hash function

# Expansion and Shrinking of Hash Table

• The efficiency of hashing depends on the load factor
• If the number of data items increases, the hash table has to be expanded
• If the number of data items decreases, it is desirable to shrink the hash table
• Expansion/shrinking can be implemented by re-inserting the data into a new hash table
(changing the divisor of the modulo operation)
• Expansion/shrinking is heavy (time: O(n))

# Analysis of the Time Complexity of Expansion

• If the hash table is expanded for each data insertion, this is extremely inefficient
• Limit the number of expansions:
• Increase the size of the hash table whenever the number of data items doubles
• The time needed for the insertion of n data items (n=2x) is
2 + 4 + 8 + ... + n/2 + n < 2n = O﻿(n)
• The time complexity per data item is O(n)/n = O(1)

(simple example of amortized analysis)

# Evaluation of Hashing

Advantages:

• Search/insertion/deletion are possible in (average) constant time
• Reasonably good actual performance
• No need for keys to be ordered
• Wide field of application

Problems:

• Sorting needs to be done separately
• Proximity/similarity search is impossible
• Expansion/shrinking requires time (possible operation interrupt)

# Comparison of Dictionary Implementations

Implementation Search Insertion Deletion Sorting
Sorted array O(log n) O(n) O(n) O(n)
Unordered array/linked list O(n) O(1) O(1) O(n log n)
Balanced tree O(log n) O(log n) O(log n) O(n)
Hash table O(1) O(1) O(1) O(n log n)

# The Ruby `Hash` Class

(Perl: `hash`; Java: `HashMap`; Python: `dict`)

• Because dictionary ADTs are often implemented using hashing,
in many programming languages, dictionaries are are called hash
• Creation: `my_hash = {}` or ```my_hash = Hash.new```
• Initialization: ```months = {'January' => 31, 'February' => 28, 'March' => 31, ... }```
• Insertion/replacement: `months['February'] = 29`
• Lookup: `this_month_length = months[this_month]`
• `Hash` in Ruby has more functionality than in other programming languages (presentation)

# Implementation of Hashing in Ruby

• Source: st.c
• Used chaining until very recently
(originally by Peter Moore, University of California Berkeley (1989))
• On Nov. 7, 2016 replaced by open addressing (by Vladimir Makarov, with help from Yura Sokolov)
• Reason: Faster because open addressing works better with cash hierarchy
• Used inside Ruby, too:
• Lookup of global identifiers such as class names
• Lookup of methods for each class
• Lookup of instance variables for each object

# Summary

• A hash table implements a dictionary ADT using a hash function
• Main points:
Selection of hash function
Conflict resolution methods (chaining or open addressing)
• Reanosably good actual performance
• Wide field of application

# Glossary

direct addressing

hashing, scatter storage technique
ハッシュ法、挽き混ぜ法
hash function
ハッシュ関数
hash table
ハッシュ表
joseki

universal hashing

denial of service attack
DOS 攻撃、サービス拒否攻撃
perfect hash function

cryptographic hash function

electronic signatures

conflict

Poisson distribution
ポアソン分布
chaining
チェイン法、連鎖法
open addressing

load factor

linear probing

quadratic probing

divisor
(割り算の) 法
amortized analysis

proximity search

similarity search