Algorithms for String Matching

(文字列照合のアルゴリズム)

Data Structures and Algorithms

11th lecture, December 7, 2017

http://www.sw.it.aoyama.ac.jp/2017/DA/lecture11.html

Martin J. Dürst

AGU

© 2009-17 Martin J. Dürst 青山学院大学

Today's Schedule

 

Summary of Last Lecture

 

Leftovers of Last Lecture

 

Overview of String Matching

Goal: To find a short pattern p in a long text t

Background: String matching algorithms in various programming languages

...paternpattnetrnternapatternatnetnpttepneretanetpat...

 

String Matching Context

 

Simplistic Implementation

 

Overview of the Rabin-Karp Algorithm

 

Selecting the Hash Function

 

Speeding up the Hash Function

 

Example of Rabin-Karp

Pattern: 081205

Text: 28498608120598743297

b = 10, d = 9

 (for manual calculation, casting out nines is helpful)

Example of implementing Rabin-Karp algorithm in Excell: BRabin-Karp.xls

Pseudocode of Rabin-Karp algorithm: Bstringmatch.rb

 

Overview of Knuth-Morris-Pratt Algorithm

 

Details of Knuth-Morris-Pratt Algorithm

 

Time Complexity of Knuth-Morris-Pratt Algorithm

 

Precomputation for Knuth-Morris-Pratt Algorithm

For all positions x in the pattern,
assuming that p[x] does not match
but that p[0] ... p[x-1] (length x) already match,
the length of the longest matching prefix and suffix of the already matching part
indicates the position in the pattern to match next

Pseudocode for Knuth-Morris-Pratt algorithm: Bstringmatch.rb

 

Overview of Boyer-Moore Algorithm

 

Idea Details

Two guidelines for shifting the pattern:

  1. Pattern-internal comparison
    (back-to-front version of Knuth-Morris-Pratt algorithm)
  2. The rightmost position in the pattern of the non-matching character in the text
    text e f g h i j k a m n o
    pattern a o g a k u
    next pattern position a o g a k u

Select the larger shift

 

Time Complexity of Boyer-Moore Algorithm

 

String Matching and Character Encoding

 

Character Encodings and Byte Patterns

 

Outlook

 

Summary

Homework:

 

Glossary

string matching
文字列照合
pattern
パターン
text
文書
substring
部分文字列
simplistic
素朴な、単純すぎる
nucleotide
ヌクレオチド
amino acid
アミノ酸
precomputation
事前計算
casting out nines
九去法
prefix
接頭部分列
suffix
末尾部分列
character encoding
文字コード
lexical analysis
字句解析
finite automaton (pl. automata)
有限オートマトン
regular expression
正規表現