(ハッシュ関数とハッシュ表)

http://www.sw.it.aoyama.ac.jp/2018/DA/lecture10.html

© 2009-18 Martin J. Dürst 青山学院大学

- Leftovers and summary of last lecture
- Additional speedup for dictionary
- Overview of hashing
- Hash functions
- Conflict resolution
- Evaluation of hashing
- Hashes in Ruby
- Summary

- Balanced trees keep search/insertion/deletion in a dictionary ADT at
`O`(log`n`) worst-case time - 2-3-4 trees and B(+)trees increase the degree of a binary tree, but keep the tree height constant
- Red-black-trees and AVL-trees impose limitations on the variation of the tree heigh
- B-trees and B+ trees are very useful for file systems and databases on secondary storage

Implementation | Search | Insertion | Deletion |
---|---|---|---|

Sorted array | O(log n) |
O(n) |
O(n) |

Unordered array/linked list | O(n) |
O(1) |
O(n) |

Balanced tree | O(log n) |
O(log n) |
O(log n) |

- Use an array with an element for each key value
- Search:
`value = array[key]`

, time:`O`(1) - Insertion/replacement:
`array[key] = value`

, time:`O`(1) - Deletion:
`array[key] = nil`

, time:`O`(1) - Example:

`students = []`

students[15817000] = "I.T. Aoyama"

Problem: Array size, non-numeric keys

Solution: Transform key with hash function

(also called *scatter storage technique*)

- Transform the key
`k`to a compact space using the*hash function*`hf` - Use
`hf`(`k`) instead of`k`in the same way as direct addressing - The data is contained in the
*hash table*(`table`

below) - The hash function can be evaluated in constant time (
`O`(1)) - Search:
`value = table[hf(key)]`

, time:`O`(1) - Insertion/replacement:
`table[hf(key)] = value`

, time:`O`(1) - Deletion:
`table[hf(key)] = nil`

, time:`O`(1)

- Choice/design of hash function
Example 1: remainder:

`def hf(k); k % 100; end`

`students[15815000 % 100] = "I.T.Aoyama"`

Example 2: sum of codepoints:

`def hf(k); k.codepoints.sum; end`

`students["HanakoAoyama".codepoints.sum] = ...`

- Resolution of conflicts
What happens with the following:

`students[15815000 % 100] = "I.T.Aoyama"`

`students[15715000 % 100] = "K.S.Aoyama"`

- Goals:
- From a key, calculate an index that is smoothly (randomly)
distributed

(this is the reason for the word`hash`, as in hashed beef or hash brows) - Adjust the range of the result to the size of the hash table

- From a key, calculate an index that is smoothly (randomly)
distributed
- Steps:
- Calculate a large integer (e.g.
`int`

in C) from the key - Adjust this large integer to the hash table size using a modulo operation

- Calculate a large integer (e.g.

Step 2 is easy. Therefore, we concentrate on step 1.

(often step 1 alone is called 'hash function')

int sdbm_hash(char key[]) { int hash = 0; while (*key) { hash = *key++ + hash<<6

+ hash<<16 - hash; } return h; }

(simplified from MurmurHash3; for 32-bit machines)

#define ROTL32(n,by) (((n)<<(by)) | ((n)>>(32-(by)))) int too_simple_hash(int key[], int length) { int h = 0; for (int i=0; i<length; i++) { int k = key[i] * C1; // C1 is a constant h ^= ROTL32(k, R1); // R1 is a constant } h ^= h >> 13; h *= 0xc2b2ae35; return h; }

Frequent operations in hash functions: Addition (`+`

),
multiplication (`*`

), bitwise XOR (`^`

), shift
(`<<`

, `>>`

)

- Quality of distribution
- Execution speed
- Ease of implementation

- Use all parts of the key

Counterexample: Using only characters 3 and 4 of a string → bad distribution - Do not use data besides the key

If some data attributes (e.g. price of a product, student's total marks) change, the key will change and the data will not be found anymore - Collapse equivalent keys

Examples: For text: Upper/lower case letters, For the game of Go: top/bottom, left/right, diagonal, and black/white symmetries

- A conflict happens when
`hf`(`k`_{1}) =`hf`(`k`_{2}) but`k`_{1}≠`k`_{2} - Conflicts happens quite easily
- This requires special treatment
- Main solutions:
- Chaining
- Open addressing

- Number of data items:
`n` - Fields in hash table: bins/buckets
- Number of bins:
`m`

(equal to the range of values of the hash function after the modulo operation) - Load factor (average number of data items per bin):
`α`(=`n`/`m`) - For a good (close to random) hash function, the variation in the number
of data items for each bin is low

(Poisson distribution)

- Store conflicting data items in a linked list
- Each bin in the hash table is the head of a linked list
- If the linked list is short, then search/insertion/deletion will be fast
- The average length of the linked list is equal to load factor α
- The load factor is usually greater than 1 (e.g. 3≦α≦6)
- All operations are carried out in three steps:
- Use
`hf`(`k`) mod`m`to find the bin - Use
`hf`(`k`) (without modulo operation) to find a candidate entry in the linked list - Use the actual key
`k`to confirm that we found the correct data item in the linked list

- Use

- Implementation in Ruby: Ahashdictionary.rb
- Uses
`Array`

in place of linked list - Uses Ruby's
`hash`

function

- Store key and data in hash table itself
- In case of conflict, successively check different bins
- For check number
`i`, use hash function`ohf`(`key`,`i`)- Linear probing:
`ohf`(`key`,`i`) =`hf`(`key`) +`i` - Quadratic probing:
`ohf`(`key`,`i`) =`hf`(`key`) +`c`_{1}`i`+`c`_{2}`i`^{2} - Many other variations exist

- Linear probing:
- The load factor has to be between 0 and 1; ≦0.5 is reasonable
- Problem: Deletion is difficult

(average, for chaining)

- Calculation of hash function
- Dependent on key length
`O`(1) if key length is constant or limited

- Search in bin
- Dependent on load factor
`O`(1) if load factor is below a given constant`O`(`n`) in worst case, but this can be avoided by choice of hash function

- The efficiency of hashing depends on the load factor
- If the number of data items increases, the hash table has to be expanded
- If the number of data items decreases, it is desirable to shrink the hash table
- Expansion/shrinking can be implemented by re-inserting the data into a
new hash table

(changing the divisor of the modulo operation) - Expansion/shrinking is heavy (time:
`O`(`n`))

- If the hash table is expanded for every data insertion, this is extremely inefficient
- Limit the number of expansions:
- Increase the size of the hash table whenever the number of data items doubles
- The time needed for the insertion of
`n`data items (`n`=2^{x}) is

2 + 4 + 8 + ... +`n`/2 +`n`< 2`n`=`O`(`n`) - The time complexity per data item is
`O`(`n`)/`n`=`O`(1)

(This is a simple example of *amortized analysis*.)

- Universal hashing
- Perfect hash function
- Cryptographic hash function

- Include a random number to create a different hash function for each program execution
- Solution for some denial-of-service attacks:
- Provide lots of data with same hash value
- Efficiency of hash degrades from
`O`(1) to`O`(`n`)

- Custom-designed hash function without conflicts
- Useful when data is completely predefined
- In the best case, the hash table is completely filled
- Application: Keywords in programming languages
- Example implementation: gnu gperf (in
Japanese)

(used in Ruby character property lookups)

- Used for
*electronic signatures, ...* - Differences from general hash functions:
- Output usually longer (e.g. 128/256/384/512/... bits)
- Practically impossible to generate same output from different input
- Much more difficult to invert (find
`k`from`hf`(`k`)) - Evaluation may take longer

Advantages:

- Search/insertion/deletion are possible in (average)
**constant**time - Reasonably good actual performance
- No need for keys to be numeric or ordered
- Wide field of application

Problems:

- Sorting needs to be done separately

(Ruby`Hash`

es store insertion order, but not key order) - Proximity/similarity search is impossible
- Expansion/shrinking requires time (possible operation interrupt)

Implementation | Search | Insertion | Deletion | Sorting |
---|---|---|---|---|

Sorted array | O(log n) |
O(n) |
O(n) |
O(n) |

Unordered array/linked list | O(n) |
O(1) |
O(n) |
O(n log n) |

Balanced tree | O(log n) |
O(log n) |
O(log n) |
O(n) |

Hash table | O(1) |
O(1) |
O(1) |
O(n log n) |

`Hash`

Class(Perl: `hash`

; Java: `HashMap`

; Python:
`dict`

)

- Because dictionary ADTs are often implemented using hashing,

in many programming languages, dictionaries are also called "hash" - Creation:
`my_hash = {}`

or`my_hash = Hash.new`

- Initialization:
`months = {'January' => 31, 'February' => 28, 'March' => 31,`

`...`} - Insertion/replacement:
`months['February'] = 29`

- Lookup:
`this_month_length = months[this_month]`

`Hash`

in Ruby has more functionality than in other programming languages (presentation)

- Source: st.c
- Used chaining until 2016

(originally by Peter Moore, University of California Berkeley (1989)) - On Nov. 7, 2016 replaced by open addressing (by Vladimir Makarov, with help from Yura Sokolov)
- Reason: Faster because open addressing works better with modern cash hierarchy
- Used inside Ruby, too:
- Lookup of global identifiers such as class names
- Lookup of methods for each class
- Lookup of instance variables for each object

- A
*hash table*implements a dictionary ADT using a*hash function* - Main design points:
- Selection of hash function
- Conflict resolution methods (
*chaining*or*open addressing*)

- Reanosably good actual performance
- Wide field of application

- direct addressing
- 直接アドレス表
- hashing, scatter storage technique
- ハッシュ法、挽き混ぜ法
- hash function
- ハッシュ関数
- hash table
- ハッシュ表
- game of Go
- 囲碁
- joseki
- 定石 (囲碁)
- universal hashing
- 万能ハッシュ法
- denial of service attack
- DOS 攻撃、サービス拒否攻撃
- perfect hash function
- 完全ハッシュ関数
- cryptographic hash function
- 暗号技術的ハッシュ関数
- electronic signatures
- 電子署名
- conflict
- 衝突
- Poisson distribution
- ポアソン分布
- chaining
- チェイン法、連鎖法
- open addressing
- 開番地法、オープン法
- load factor
- 占有率
- linear probing
- 線形探査法
- quadratic probing
- 二次関数探査法
- divisor
- (割り算の) 法
- amortized analysis
- 償却分析
- proximity search
- 近接探索
- similarity search
- 類似探索