(ハッシュ関数とハッシュ表)

http://www.sw.it.aoyama.ac.jp/2016/DA/lecture10.html

© 2009-16 Martin J. Dürst 青山学院大学

- Summary of last lecture
- Additional speedup for dictionary
- Overview of hashing
- Hash functions
- Conflict resolution
- Evaluation of hashing
- Hashes in Ruby
- Summary

- Balanced trees keep search/insertion/deletion in a dictionary ADT at
`O`(log`n`) worst-case time - 2-3-4 trees and B(+)trees increase the degree of a binary tree, but keep the tree height constant
- Red-black-trees and AVL-trees impose limitations on the variation of the tree heigh
- B-trees and B+ trees are very useful for file systems and databases on secondary storage

Implementation | Search | Insertion | Deletion |
---|---|---|---|

Sorted array | O(log n) |
O(n) |
O(n) |

Unordered array/linked list | O(n) |
O(1) |
O(n) |

Balanced tree | O(log n) |
O(log n) |
O(log n) |

- Use an array with an element for each key value
- Search:
`value = array[key]`

, time:`O`(1) - Insertion/replacement:
`array[key] = value`

, time:`O`(1) - Deletion:
`array[key] = nil`

, time:`O`(1) - Example:

`students = []`

students[15815000] = "I.T. Aoyama"

Problem: Array size, non-numeric keys

Solution: Transform key with hash function

(also called scatter storage technique)

- Transform the key
`k`to a compact space using the*hash function*`hf` - Use
`hf`(`k`) instead of`k`in the same way as direct addressing - The array is called
*hash table* - The hash function is evaluated in
`O`(1) time - Search:
`value = table[hf(key)]`

, time:`O`(1) - Insertion/replacement:
`table[hf(key)] = value`

, time:`O`(1) - Deletion:
`table[hf(key)] = nil`

, time:`O`(1)

- Choice/design of hash function
Example 1:

`def hf(k); k % 100; end`

`students[15815000 % 100] = "I.T. Aoyama"`

Example 2:

`def hf(k); k.codepoints.sum; end`

`students["Hanako Aoyama".codepoints.sum] = ...`

- Resolution of conflicts
What happens with the following:

`students[15815000 % 100] = "I.T. Aoyama"`

`students[15715000 % 100] = "K.S. Aoyama"`

- Goals:
- From a key, calculate an index that is as smoothly distributed as
possible

(this is the reason for the word`hash`, as in hashed beef or hash brows) - Adjust the range of the result to the hash table size

- From a key, calculate an index that is as smoothly distributed as
possible
- Steps:
- Calculate a large integer (e.g.
`int`

in C) from the key - Adjust this large integer to the hash table size using a modulo operation

- Calculate a large integer (e.g.

Step 2 is easy. Therefore, we concentrate on step 1.

(often step 1 alone is called 'hash function')

int sdbm_hash(char key[]) { int hash = 0; while (*key) { hash = *key++ + hash<<6

+ hash<<16 - hash; } }

(simplified from MurmurHash3; for 32 bit machines)

#define ROTL32(n,by) (((n)<<(by)) | ((n)>>(32-(by)))) int too_simple_hash(int key[], int length) { int h = 0; for (int i=0; i<length; i++) { int k = key[i] * C1; // C1 is a constant h ^= ROTL32(k, R1); // R1 is a constant } h ^= h >> 13; h *= 0xc2b2ae35; return }

- Quality of distribution
- Execution speed
- Ease of implementation

- Use all parts of the key

Counterexample: Using only characters 3 and 4 of a string → bad distribution - Do not use data besides the key

If some data attributes (e.g. price of a product, studen's total marks) change, the key will change and the data will not be found anymore - Collapse equivalent keys

Examples: Upper/lower case letters, top/bottom/left/right/black/white symmetries for the game of Go

- Universal hashing
- Include a random number to create a different hash function for each program execution
- Solution for some denial-of-service attacks

(reference)

- Perfect hash function
- Custom-designed hash function without conflicts
- Useful when data is completely predefined
- In the best case, the hash table is completely filled
- Application: Keywords in programming languages
- Example implementation: gnu gperf (in Japanese)

- Used for
*electronic signatures, ...* - Differences from general hash functions:
- Practically impossible to generate same output from different input
- Output usually longer (e.g. 128/256/384/512/... bits)
- Evaluation may take longer

- A conflict happens when
`hf`(`k`_{1}) =`hf`(`k`_{2}) even though`k`_{1}≠`k`_{2} - This requires special treatment
- Main solutions:
- Chaining
- Open addressing

- Fields in hash table: bins/buckets
- Number of bins:
`m`

(equal to the range of values of the hash function after the modulo operation) - Number of data items:
`n` - Load factor (average number of data items per bin):
`α`

(`α`=`n`/`m`) - For a good (close to random) hash function, the variation in the number
of data items for each bin is low

(Poisson distribution)

- Store conflicting data items in a linked list
- Each bin in the hash table is the head of a linked list
- If the linked list is short, then search/insertion/deletion will be fast
- The average length of the linked list is equal to load factor α
- The load factor is usually greater than 1 (e.g. 3≦α≦6)
- All operations are carried out in two steps:
- Use
`hf`(`k`) to determine the bin - Use the key to find the data item in the linked list

- Use

- Implementation in Ruby: Ahashdictionary.rb
- Uses
`Array`

in place of linked list - Uses Ruby's hash function

- Store key and data in hash table itself
- In case of conflict, successively check different bin
- For check number
`i`, use hash function`ohf`(`key`,`i`)- Linear probing:
`ohf`(`key`,`i`) =`hf`(`key`) +`i` - Quadratic probing:
`ohf`(`key`,`i`) =`hf`(`key`) +`c`_{1}`i`+`c`_{2}`i`^{2} - Many other variations exist

- Linear probing:
- The load factor has to be between 0 and 1; ≦0.5 is reasonable
- Problem: Deletion is difficult

(average, for chaining)

- Calculation of hash function
- Dependent on key length
`O`(1) if key length is constant or limited

- Search in bin
- Dependent on load factor
`O`(1) if load factor is below a given constant`O`(`n`) in worst case, but this can be avoided by choice of hash function

- The efficiency of hashing depends on the load factor
- If the number of data items increases, the hash table has to be expanded
- If the number of data items decreases, it is desirable to shrink the hash table
- Expansion/shrinking can be implemented by re-inserting the data into a
new hash table

(changing the divisor of the modulo operation) - Expansion/shrinking is heavy (time:
`O`(`n`))

- If the hash table is expanded for each data insertion, this is extremely inefficient
- Limit the number of expansions:
- Increase the size of the hash table whenever the number of data items doubles
- The time needed for the insertion of
`n`data items (`n`=2^{x}) is

2 + 4 + 8 + ... +`n`/2 +`n`< 2`n`=`O`(`n`) - The time complexity per data item is
`O`(`n`)/`n`=`O`(1)

(simple example of *amortized analysis*)

Advantages:

- Search/insertion/deletion are possible in (average)
**constant**time - Reasonably good actual performance
- No need for keys to be ordered
- Wide field of application

Problems:

- Sorting needs to be done separately
- Proximity/similarity search is impossible
- Expansion/shrinking requires time (possible operation interrupt)

Implementation | Search | Insertion | Deletion | Sorting |
---|---|---|---|---|

Sorted array | O(log n) |
O(n) |
O(n) |
O(n) |

Unordered array/linked list | O(n) |
O(1) |
O(1) |
O(n log n) |

Balanced tree | O(log n) |
O(log n) |
O(log n) |
O(n) |

Hash table | O(1) |
O(1) |
O(1) |
O(n log n) |

`Hash`

Class(Perl: `hash`

; Java: `HashMap`

; Python:
`dict`

)

- Because dictionary ADTs are often implemented using hashing,

in many programming languages, dictionaries are are called hash - Creation:
`my_hash = {}`

or`my_hash = Hash.new`

- Initialization:
`months = {'January' => 31, 'February' => 28, 'March' => 31,`

`...`} - Insertion/replacement:
`months['February'] = 29`

- Lookup:
`this_month_length = months[this_month]`

`Hash`

in Ruby has more functionality than in other programming languages (presentation)

- Source: st.c
- Used chaining until very recently

(originally by Peter Moore, University of California Berkeley (1989)) - On Nov. 7, 2016 replaced by open addressing (by Vladimir Makarov, with help from Yura Sokolov)
- Reason: Faster because open addressing works better with cash hierarchy
- Used inside Ruby, too:
- Lookup of global identifiers such as class names
- Lookup of methods for each class
- Lookup of instance variables for each object

- A
*hash table*implements a dictionary ADT using a*hash function* - Main points:

Selection of hash function

Conflict resolution methods (*chaining*or*open addressing*) - Reanosably good actual performance
- Wide field of application

- direct addressing
- 直接アドレス表
- hashing, scatter storage technique
- ハッシュ法、挽き混ぜ法
- hash function
- ハッシュ関数
- hash table
- ハッシュ表
- joseki
- 定石 (囲碁)
- universal hashing
- 万能ハッシュ法
- denial of service attack
- DOS 攻撃、サービス拒否攻撃
- perfect hash function
- 完全ハッシュ関数
- cryptographic hash function
- 暗号技術的ハッシュ関数
- electronic signatures
- 電子署名
- conflict
- 激突
- Poisson distribution
- ポアソン分布
- chaining
- チェイン法、連鎖法
- open addressing
- 開番地法、オープン法
- load factor
- 占有率
- linear probing
- 線形探査法
- quadratic probing
- 二次関数探査法
- divisor
- (割り算の) 法
- amortized analysis
- 償却分析
- proximity search
- 近接探索
- similarity search
- 類似探索