# Hash Functions and Hash Tables

(ハッシュ関数とハッシュ表)

## Data Structures and Algorithms

### 10th lecture, December 5, 2019

http://www.sw.it.aoyama.ac.jp/2019/DA/lecture10.html

### Martin J. Dürst © 2009-19 Martin J. Dürst 青山学院大学

# Today's Schedule

• Leftovers and summary of last lecture
• Overview of hashing
• Hash functions
• Conflict resolution
• Evaluation of hashing
• Hashes in Ruby
• Summary

# Summary of Last Lecture

• Balanced trees keep search/insertion/deletion in a dictionary ADT at O(log n) worst-case time
• 2-3-4 trees and B(+)trees increase the degree of a binary tree, but keep the tree height constant
• Red-black-trees and AVL-trees impose limitations on the variation of the tree heigh
• B-trees and B+ trees are very useful for file systems and databases on secondary storage

# Time Complexity for Known Dictionary Implementations

Implementation Search Insertion Deletion
Sorted array O(log n) O(n) O(n)
Unordered array/linked list O(n) O(1) O(n)
Balanced tree O(log n) O(log n) O(log n)

• Use an array with an element for each key value
• Search: `value = array[key]`, time: O(1)
• Insertion/replacement: `array[key] = value`, time: O(1)
• Deletion: `array[key] = nil`, time: O(1)
• Example:
```students = [] students = "I.T. Aoyama"```

Problem: Array size, non-numeric keys

Solution: Transform key with a hash function

# Overview of Hashing

(also called scatter storage technique)

• Transform the key k to a compact space using the hash function hf
• Use hf(k) instead of k in the same way as direct addressing
• The data is contained in the hash table (`table` below)
• The hash function can be evaluated in constant time (O(1))
• Search: `value = table[hf(key)]`, time: O﻿(1)
• Insertion/replacement: ```table[hf(key)] = value```, time: O(1)
• Deletion: `table[hf(key)] = nil`, time: O(1)

# Problems with Hashing

1. Choice/design of hash function

Example 1:
remainder: `def hf(k); k % 100; end`

`students[15818000 % 100] = "I.T.﻿Aoyama"`

Example 2:
sum of codepoints (character numbers): ```def hf(k); k.codepoints.sum; end```

`students["Hanako﻿Aoyama".codepoints.sum] = ...`

2. Resolution of conflicts

What happens with the following:

`students[15818000 % 100] = "I.T.﻿Aoyama"`

`students[15718000 % 100] = "K.S.﻿Aoyama"`

# Overview of Hash Function

• Goals:
1. From a key, calculate an index that is smoothly (randomly) distributed
(this is the reason for the word hash, as in hashed beef or hash brows)
2. Adjust the range of the result to the size of the hash table
• Steps:
1. Calculate a large integer (e.g. `int` in C) from the key
2. Adjust this large integer to the hash table size using a modulo operation

Goal/step 2 is easy. Therefore, we concentrate on goal/step 1.
(often step 1 alone is called 'hash function')

# Hash Function Example 1

```int sdbm_hash(char key[])
{
int hash = 0;
while (*key) {
hash = *key++ + hash<<6               + hash<<16 - hash;
}
return hash;
}```

# Hash Function Example 2

(simplified from MurmurHash3; for 32-bit machines)

```#define ROTL32(n,by) (((n)<<(by)) | ((n)>>(32-(by))))
int too_simple_hash(int key[], int length)
{
int h = 0;
for (int i=0; i<length; i++) {
int k = key[i] * C1;  // C1 is a constant
h ^= ROTL32(k, R1);  // R1 is a constant
}
h ^= h >> 13;
h *= 0xc2b2ae35;
return h;
}```

Frequent operations in hash functions: Addition (`+`), multiplication (`*`), bitwise XOR (`^`), shift (`<<`, `>>`)

# Evaluation of Hash Functions

• Quality of distribution
• Execution speed
• Ease of implementation

# Precautions for Hash Functions

• Use all parts of the key
Counterexample: Using only characters 3 and 4 of a string → bad distribution
• Do not use data besides the key
If some data attributes (e.g. price of a product, student's total marks) change, the key will change.
There will be no way anymore to find the data.
• Collapse equivalent keys
Example 1: Strings: Upper/lower case letters
Example 2: Corners in the game of Go: top/bottom, left/right, diagonal, and black/white symmetries

# Conflicts

• A conflict happens when hf(k1) = hf(k2) but k1k2
• Conflicts happen quite easily
• This requires special treatment
• Main solutions:
• Chaining

# Terms and Variables for Conflict Resolution

• Number of data items: n
• Fields in hash table: bins/buckets
• Number of bins: m
(equal to the range of values of the hash function after the modulo operation)
• Load factor (average number of data items per bin): α (= n/m)
• For a good (close to random) hash function, the variation in the number of data items for each bin is low
(Poisson distribution)

# Chaining

• Store conflicting data items in a linked list
• Each bin in the hash table is the head of a linked list
• If the linked list is short, then search/insertion/deletion will be fast
• The average length of the linked list is equal to load factor α
• The load factor is usually greater than 1 (e.g. 3≦α≦6)
• All operations are carried out in three steps:
1. Use hf(k) mod m to find the bin
2. Use hf(k) (without modulo operation) to find a candidate entry in the linked list
3. Use the actual key k to confirm that we found the correct data item in the linked list

# Implementation of Chaining

• Implementation in Ruby: Ahashdictionary.rb
• Uses `Array` in place of linked list
• Uses Ruby's `hash` function

• Store key and data in hash table itself
• In case of conflict, successively check different bins
• For check number i, use hash function ohf(key, i)
• Linear probing: ohf(key, i) = hf(key) + i
• Quadratic probing: ohf(key, i) = hf(key) + c1 i + c2 i2
• Many other variations exist
• The load factor has to be between 0 and 1; ≦0.5 is reasonable
• Problem: Deletion is difficult

# Time Complexity of Hashing

(average, for chaining)

• Calculation of hash function
• Dependent on key length
• O(1) if key length is constant or limited
• Search in bin
• O(1) if load factor is below a given constant
• O(n) in worst case, but this can be avoided by choice of hash function

# Expansion and Shrinking of Hash Table

• The efficiency of hashing depends on the load factor
• If the number of data items increases, the hash table has to be expanded
• If the number of data items decreases, it is desirable to shrink the hash table
• Expansion/shrinking can be implemented by re-inserting the data into a new hash table
(changing the divisor of the modulo operation)
• Expansion/shrinking is heavy (time: O(n))

# Analysis of the Time Complexity of Expansion

• If the hash table is expanded for every data insertion, this is extremely inefficient
• Limit the number of expansions:
• Increase the size of the hash table whenever the number of data items doubles
• The time needed for the insertion of n data items (n=2x) is
2 + 4 + 8 + ... + n/2 + n < 2n = O﻿(n)
• The time complexity per data item is O(n)/n = O﻿(1)

(This is a simple example of amortized analysis.)

# Special Purpose Hash Functions

• Universal hashing
• Perfect hash function
• Cryptographic hash function

# Universal Hashing

• Denial-of-service attack:
• Attacker provides lots of data with same hash value
• Efficiency of hash degrades from O(1) to O(n)
• Solution: Use a random number to create a different hash function for each program execution
• Example: ruby -e 'puts 123.hash, 123.hash' will produce different results on different invocation, but the same result during the same invocation

# Perfect Hash Function

• Custom-designed hash function without conflicts
• Useful when data is completely predefined
• In the best case, the hash table is completely filled
• Application: Keywords in programming languages
• Example implementation: gnu gperf (in Japanese)
(used in Ruby character property lookups)

# Cryptographic Hash Function

• Used for electronic signatures, ...
• Differences from general hash functions:
• Output usually longer (e.g. 128/256/384/512/... bits)
• Practically impossible to generate same output from different input
• Much more difficult to invert (find k from hf(k))
• Evaluation may take longer

# Evaluation of Hashing

• Search/insertion/deletion are possible in (average) constant time
• Reasonably good actual performance
• No need for keys to be numeric or ordered
• Wide field of application

Problems:

• Sorting needs to be done separately
(Ruby `Hash`es store insertion order, but not key order)
• Proximity/similarity search is impossible
• Expansion/shrinking requires time (possible operation interrupt)

# Comparison of Dictionary Implementations

Implementation Search Insertion Deletion Sorting
Sorted array O(log n) O(n) O(n) O(n)
Unordered array/linked list O(n) O(1) O(n) O(n log n)
Balanced tree O(log n) O(log n) O(log n) O(n)
Hash table O(1) O(1) O(1) O(n log n)

# The Ruby `Hash` Class

(Perl: `hash`; Java: `HashMap`; Python: `dict`)

• Because dictionary ADTs are often implemented using hashing,
in many programming languages, dictionaries are also called "hash"
• Creation: `my_hash = {}` or ```my_hash = Hash.new```
• Initialization: ```months = {'January' => 31, 'February' => 28, 'March' => 31, ... }```
• Insertion/replacement: `months['February'] = 29`
• Lookup: `this_month_length = months[this_month]`
• `Hash` in Ruby has more functionality than in other programming languages (presentation)

# Implementation of Hashing in Ruby

• Source: st.c
• Used chaining until 2016
(originally by Peter Moore, University of California Berkeley (1989))
• On Nov. 7, 2016 replaced by open addressing (by Vladimir Makarov, with help from Yura Sokolov)
• Reason: Faster because open addressing works better with modern cache hierarchy
• Used inside Ruby, too:
• Lookup of global identifiers such as class names
• Lookup of methods for each class
• Lookup of instance variables for each object

# Summary

• A hash table implements a dictionary ADT using a hash function
• Main design points:
• Selection of hash function
• Conflict resolution methods (chaining or open addressing)
• Reasonably good actual performance
• Wide field of application

# Glossary

hashing, scatter storage technique
ハッシュ法、挽き混ぜ法
hash function
ハッシュ関数
hash table
ハッシュ表
game of Go

joseki

conflict

Poisson distribution
ポアソン分布
chaining
チェイン法、連鎖法

linear probing

divisor
(割り算の) 法
amortized analysis

universal hashing

perfect hash function

denial of service attack
DOS 攻撃、サービス拒否攻撃
cryptographic hash function

electronic signature

proximity search

similarity search